Machine Learning for Hackers: Case Studies and Algorithms to Get You Started
Drew Conway John White

Ended: Sept. 9, 2013

see how we can learn about parameters in R, suppose that we had never used the read.delim function before and needed to read the help files. Alternatively, assume that we do not know that read.delim exists and need to find a function to read delimited data into a data frame. R offers several useful functions for searching for help: ?read.delim                     # Access a function's help file ??base::delim                   # Search for 'delim' in all help files for functions                                 #  in 'base' help.search("delimited")        # Search for 'delimited' in all help files RSiteSearch("parsing text")     # Search for the term 'parsing text' on the R site.
ages <- read.csv('data/longevity.csv') constant.guess <- with(ages, mean(AgeAtDeath)) with(ages, sqrt(mean((AgeAtDeath - constant.guess) ^ 2))) smokers.guess <- with(subset(ages, Smokes == 1),                       mean(AgeAtDeath)) non.smokers.guess <- with(subset(ages, Smokes == 0),                           mean(AgeAtDeath)) ages <- transform(ages,                   NewPrediction = ifelse(Smokes == 0,                                          non.smokers.guess,                                          smokers.guess)) with(ages, sqrt(mean((AgeAtDeath - NewPrediction) ^ 2)))
library('ggplot2') heights.weights <- read.csv('data/01_heights_weights_genders.csv',                             header = TRUE,                             sep = ',')    ggplot(heights.weights, aes(x = Height, y = Weight)) +   geom_point() +   geom_smooth(method = 'lm')
coef(fitted.regression) #(Intercept)      Height #-350.737192    7.717288 Understanding this output is, in some sense, what it means to understand linear regression. The simplest way to make sense of the output from coef is to write out explicitly the relationship that its output implies: intercept <- coef(fitted.regression)[1] slope <- coef(fitted.regression)[2] # predicted.weight <- intercept + slope * observed.height # predicted.weight == -350.737192 + 7.717288 * observed.height
First, we want to get a sense of where our model is wrong. We do this by computing the model’s predictions and comparing them against the inputs. To find the model’s predictions, we use the predict function in R: predict(fitted.regression) Once we have our predictions in hand, we can calculate the difference between our predictions and the truth with a simple subtraction: true.values <- with(heights.weights,Weight) errors <- true.values - predict(fitted.regression) The errors we calculate this way are called residuals because they’re the part of our data that’s left over after accounting for the part that a line can explain. You can get these residuals directly in R without the predict function by using the residuals function instead: residuals(fitted.regression)
plot(fitted.regression, which = 1) Note Here we’ve asked R to plot only the first regression diagnostic plot by specifying which = 1. There are other plots you can get, and we encourage you to experiment to see if you find any of the others helpful.
The simplest measurement of error is to (1) take all the residuals, (2) square them to get the squared errors for our model, and (3) sum them together. x <- 1:10 y <- x ^ 2 fitted.regression <- lm(y ~ x) errors <- residuals(fitted.regression) squared.errors <- errors ^ 2 sum(squared.errors)
This simple sum of squared errors quantity is useful for comparing different models, but it has some quirks that most people end up disliking. First, the sum of squared errors is larger for big data sets than for small data sets. To convince yourself of this, imagine a fixed data set and the sum of squared errors for that data set. Now add one more data point that isn’t predicted perfectly. This new data point has to push the sum of squared errors up because adding a number to the previous sum of squared errors can only make the sum larger. But there’s a simple way to solve that problem: take the mean of the squared errors rather than the sum. That gives us the MSE measure we used earlier in this chapter. Although MSE won’t grow consistently as we get more data the way that the raw sum of squared errors will, MSE still has a problem: if the average prediction is only off by 5, then the mean squared error will be 25. That’s because we’re squaring the errors before taking their mean. x <- 1:10 y <- x ^ 2 fitted.regression <- lm(y ~ x) errors <- residuals(fitted.regression) squared.errors <- errors ^ 2 mse <- mean(squared.errors) mse #[1] 52.8 The solution to this problem of scale is trivial: we take the square root of the mean squared error to get the root mean squared error, which is the RMSE measurement we also tried out earlier in this chapter. RMSE is a very popular measure of performance for assessing machine learning algorithms, including algorithms that are far more sophisticated than linear regression. For just one example, the Netflix Prize was scored using RMSE as the definitive metric of how well the contestants’ algorithms performed. x <- 1:10 y <- x ^ 2 fitted.regression <- lm(y ~ x) errors <- residuals(fitted.regression) squared.errors <- errors ^ 2 mse <- mean(squared.errors) rmse <- sqrt(mse) rmse #[1] 7.266361
One complaint that people have with RMSE is that it’s not immediately clear what mediocre performance is. Perfect performance clearly gives you an RMSE of 0, but the pursuit of perfection is not a realistic goal in these tasks. Likewise, it isn’t easy to recognize when a model is performing very poorly. For example, if everyone’s heights are 5’ and you predict 5,000', you’ll get a huge value for RMSE. And you can do worse than that by predicting 50,000', and still worse by predicting 5,000,000’. The unbounded values that RMSE can take on make it difficult to know whether your model’s performance is reasonable. When you’re running linear regressions, the solution to this problem is to use R2 (pronounced “R squared”). The idea of R2 is to see how much better your model does than we’d do if we were just using the mean. To make it easy to interpret, R2 will always be between 0 and 1. If you’re doing no better than the mean, it will be 0. And if you predict every data point perfectly, R2 will be 1. Because R2 is always between 0 and 1, people tend to multiply it by 100 and say that it’s the percent of the variance in the data you’ve explained with your model. This is a handy way to build up an intuition for how accurate a model is, even in a new domain where you don’t have any experience about the standard values for RMSE. To calculate R2, you need to compute the…
This density plot (shown in Figure 5-7) looks as completely impenetrable as the earlier scatterplot. When you see nonsensical density plots, it’s a good idea to try taking the log of the values you’re trying to analyze and make a density plot of those log-transformed values. We can do that with a simple call to R’s log function: ggplot(top.1000.sites, aes(x = log(PageViews))) +   geom_density()
This density plot (shown in Figure 5-8) looks much more reasonable. For that reason, we’ll start using the log-transformed PageViews and UniqueVisitors from now on. Recreating our earlier scatterplot on a log scale is easy: ggplot(top.1000.sites, aes(x = log(PageViews), y = log(UniqueVisitors))) +   geom_point() Note The ggplot2 package also contains a convenience function to change the scale of an axis to the log. You can use the scale_x_log or scale_y_log in this case. Also, recall from our discussion in Chapter 4 that in some cases you will want to use the logp function in R to avoid the errors related to taking the log of zero. In this case, however, that is not a problem.
The resulting scatterplot shown in Figure 5-9 looks like there’s a potential line to be drawn using regression. Before we use lm to fit a regression line, let’s use geom_smooth with the method = 'lm' argument to see what the regression line will look like: ggplot(top.1000.sites, aes(x = log(PageViews), y = log(UniqueVisitors))) +   geom_point() +   geom_smooth(method = 'lm', se = FALSE)
The “Std. Error”, for instance, can be used to produce a 95% confidence interval for the coefficient. The interpretation of this interval can be confusing, and it is occasionally presented in a misleading way. The interval indicates the bounds for which we can say, “95% of the time, the algorithm we use to construct intervals will include the true coefficient inside of the intervals it produces.”
The “t-value” and the p-value (written as “Pr(>|t|)” in the output from summary) are both measures of how confident we are that the true coefficient isn’t zero. This is used to say that we are confident that there is a real relationship between the output and the input we’re currently examining. For example, here we can use the “t value” column to assess how sure we are that PageViews really are related to UniqueVisitors. In our minds, these two numbers can be useful if you understand how to work with them, but they are sufficiently complicated that their usage has helped to encourage people to assume they’ll never fully understand statistics.
Warning t-values and p-values are useful for deciding whether a relationship between two columns in your data is real or just an accident of chance. Deciding that a relationship exists is valuable, but understanding the relationship is another matter entirely. Regressions can’t help you do that. People try to force them to, but in the end, if you want to understand the reasons why two things are related, you need more information than a simple regression can ever provide.
Indeed, you might have noticed in our earlier calculation that the “t value” for log(UniqueVisitors) was exactly the number of standard errors away from 0 that the coefficient for log(UniqueVisitors) was. That relationship between t-values and the number of standard errors away from 0 you are is generally true, so we suggest that you don’t work with p-values at all.
The final pieces of information we get are related to the predictive power of the linear model we’ve fit to our data. The first piece, the “Residual standard error,” is simply the RMSE of the model that we could compute using sqrt(mean(residuals(lm.fit) ^ 2)). The “degrees of freedom” refers to the notion that we’ve effectively used up two data points in our analysis by fitting two coefficients: the intercept and the coefficient for log(UniqueVisitors). This number, 998, is relevant because it’s not very impressive to have a low RMSE if you’ve used 500 coefficients in your model to fit 1,000 data points. Using lots of coefficients when you have very little data is a form of overfitting that we’ll talk about more in Chapter 6
After that, we see the “Multiple R-squared”. This is the standard “R-squared” we described earlier, which tells us what percentage of the variance in our data was explained by our model. Here we’re explaining 46% of the variance using our model, which is pretty good. The “Adjusted R-squared” is a second measure that penalizes the “Multiple R-squared” for the number of coefficients you’ve used. In practice, we personally tend to ignore this value because we think it’s slightly ad hoc, but there are many people who are very fond of it.
Finally, the last piece of information you’ll see is the “F-statistic.” This is a measure of the improvement of your model over using just the mean to make predictions. It’s an alternative to “R-squared” that allows one to calculate a “p-value.” Because we think that a “p-value” is usually deceptive, we encourage you to not put too much faith in the F-statistic. “p-values” have their uses if you completely understand the mechanism used to calculate them, but otherwise they can provide a false sense of security that will make you forget that the gold standard of model performance is predictive power on data that wasn’t used to fit your model, rather than the performance of your model on the data that it was fit to. We’ll talk about methods for assessing your model’s ability to predict new data in Chapter 6.
Now that we’ve introduced linear regression, let’s take a quick digression to discuss in more detail the word “correlation.” In the strictest sense, two variables are correlated if the relationship between them can be described using a straight line. In other words, correlation is just a measure of how well linear regression could be used to model the relationship between two variables. 0 correlation indicates that there’s no interesting line that relates the two variables. A correlation of 1 indicates that there’s a perfectly positive straight line (going up) that relates the two variables. And a correlation of –1 indicates that there’s a perfectly negative straight line (going down) that relates the two variables. To make all of
How could we compute the correlation for ourselves rather than using cor? We can do that using lm, but first we need to scale both x and y. Scaling involves subtracting the mean of both variables and then dividing out by the standard deviation. You can perform scaling in R using the scale function:
#-1.386469e-16  9.745586e-01 As you can see, in this case the correlation between x and y is exactly equal to the coefficient relating the two in linear regression after scaling both of them. This is a general fact about how correlations work, so you can always use linear regression to help you envision exactly what it means for two variables to be correlated. Because correlation is just a measure of how linear the relationship between two variables is, it tells us nothing about causality. This leads to the maxim that “correlation is not causation.” Nevertheless, it’s very important to know whether two things are correlated if you want to use one of them to make predictions about the other.
One technical way to describe this is to say that regressions are good at interpolation but not very good at extrapolation.
We’ll use a sine wave to create a data set in which the relationship between x and y could never be described by a simple line. set.seed(1) x <- seq(0, 1, by = 0.01) y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1) df <- data.frame(X = x, Y = y) ggplot(df, aes(x = X, y = Y)) +   geom_point()
One way to do this is to follow the logic we exploited at the start of this chapter and add new features to our data set. This time we’ll add both the second power of x and the third power of x to give ourselves more wiggle room. As you see here, this change improves our predictive power considerably: df <- transform(df, X2 = X ^ 2) df <- transform(df, X3 = X ^ 3)
The problem here is that the new columns we’re adding with larger and larger powers of X are so correlated with the old columns that the linear regression algorithm breaks down and can’t find coefficients for all of the columns separately. Thankfully, there is a solution to this problem that can be found in the mathematical literature: instead of naively adding simple powers of x, we add more complicated variants of x that work like successive powers of x, but aren’t correlated with each other in the way that x and x^2 are. These variants on the powers of x are called orthogonal polynomials,[10] and you can easily generate them using the poly function in R. Instead of adding 14 powers of x to your data frame directly, you simply type poly(X, degree = 14) to transform x into something similar to X + X^2 + X^3 + ... + X^14, but with orthogonal columns that won’t generate a singularity when running lm.
Let’s do the split and then talk about the details: n <- length(x) indices <- sort(sample(1:n, round(0.5 * n))) training.x <- x[indices] training.y <- y[indices] test.x <- x[-indices] test.y <- y[-indices] training.df <- data.frame(X = training.x, Y = training.y) test.df <- data.frame(X = test.x, Y = test.y) Here we’ve constructed a random vector of indices that defines the training set. It’s always a good idea to randomly split the data when making a training set/test set split because you don’t want the training set and test set to differ systematically — as might happen, for example, if you put only the smallest values for x into the training set and the largest values into the test set. The randomness comes from the use of R’s sample function, which produces a random sample from a given vector. In our case we provide a vector of integers, 1 through n, and sample half of them. Once we’ve set the values for indices, we pull the training and test sets apart using R’s vector indexing rules. Finally, we construct a data frame to store the data because it’s easier to work with lm when using data frames.
Then we’ll loop over a set of polynomial degrees, which are the integers between 1 and 12: performance <- data.frame() for (d in 1:12) {   poly.fit <- lm(Y ~ poly(X, degree = d), data = training.df)   performance <- rbind(performance,                         data.frame(Degree = d,                                    Data = 'Training',                                    RMSE = rmse(training.y, predict(poly.fit))))   performance <- rbind(performance,                         data.frame(Degree = d,                                    Data = 'Test',                                    RMSE = rmse(test.y, predict(poly.fit,                                                newdata = test.df)))) } During each iteration of this loop, we’re fitting a polynomial regression of degree d to the data in training.df and then assessing its performance on both training.df and test.df. We store the results in a data frame so that we can quickly analyze the results after finishing the loop.
Here we use a slightly different data frame construction technique than we have in the past. We begin by assigning an empty data frame to the variable performance, to which we then iteratively add rows with the rbind function. As we have mentioned before, there are often many ways to accomplish the same data manipulation task in R, but rarely is a loop of this nature the most efficient. We use it here because the data is quite small and it allows the process of the algorithm to be understood most clearly, but keep this in mind when writing your own cross-validation tests.
So we’ll use an alternative measure of complexity: we’ll say that a model is complicated when the coefficients are large. For example, we would say that the model y ~ 5 * x + 2 is more complicated than the model y ~ 3 * x + 2. By the same logic, we’ll say that the model y ~ 1 * x^2 + 1 * x + 1 is more complicated than the model y ~ 1 * x + 1. To use this definition in practice, we can fit a model using lm and then measure its complexity by summing over the values returned by coef: lm.fit <- lm(y ~ x) model.complexity <- sum(coef(lm.fit) ^ 2) Here we’ve squared the coefficients before summing them so that they don’t cancel each other out when we add them all together. This squaring step is often called the L2 norm. An alternative approach to squaring the coefficients is to take the absolute value of the coefficients instead; this second approach is called the L1 norm. Here we compute the two of them in R: lm.fit <- lm(y ~ x) l2.model.complexity <- sum(coef(lm.fit) ^ 2) l1.model.complexity <- sum(abs(coef(lm.fit)))
For now, let’s talk about the tools you can use to work with regularization in R. For this chapter, we’ll stick to using the glmnet package, which provides a function called glmnet that fits linear models using regularization. To see how glmnet works, let’s go back one last time to our sine wave data: set.seed(1) x <- seq(0, 1, by = 0.01) y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1) To use glmnet, we first have to convert our vector copy of x to a matrix using the matrix function. After that, we call glmnet in the reverse order than the one you would use with lm and its ~-operator syntax: x <- matrix(x) library('glmnet') glmnet(x, y)
But this time we’ll loop over values of Lambda instead of degrees. Thankfully, we don’t have to refit the model each time, because glmnet stores the fitted model for many values of Lambda after a single fitting step. library('glmnet') glmnet.fit <- with(training.df, glmnet(poly(X, degree = 10), Y)) lambdas <- glmnet.fit$lambda performance <- data.frame() for (lambda in lambdas) {   performance <- rbind(performance,     data.frame(Lambda = lambda,                RMSE = rmse(test.y, with(test.df, predict(glmnet.fit, poly(X,                            degree = 10), s = lambda))))) } Having computed the model’s performance for different values of Lambda, we can construct a plot to see where in the range of lambdas we’re testing we can get the best performance on new data: ggplot(performance, aes(x = Lambda, y = RMSE)) +   geom_point() +   geom_line()
Looking at Figure 6-7, it seems like we get the best possible performance with Lambda near 0.05. So to fit a model to the full data, we can select that value and train our model on the entire data set. Here we do just that: best.lambda <- with(performance, Lambda[which(RMSE == min(RMSE))]) glmnet.fit <- with(df, glmnet(poly(X, degree = 10), Y)) After fitting our final model to the whole data set, we can use coef to examine the structure of our regularized model: coef(glmnet.fit, s = best.lambda) #11 x 1 sparse Matrix of class "dgCMatrix" #                     1 #(Intercept)  0.0101667 #1           -5.2132586 #2            0.0000000 #3            4.1759498 #4            0.0000000 #5           -0.4643476 #6            0.0000000 #7            0.0000000 #8            0.0000000 #9            0.0000000 #10           0.0000000
To start, let’s add class labels to our data set: y <- rep(c(1, 0), each = 50)
To start, we’ll fit a logistic regression to the entire data set just so you can see how it’s done: regularized.fit <- glmnet(x, y, family = 'binomial') The only difference between this call to glmnet and the call we made earlier to perform linear regression is that we set an additional parameter, family, which controls the type of errors you expect to see when making predictions. We didn’t discuss this in Chapter 5, but linear regression assumes that the errors you see have a Gaussian distribution, whereas logistic regression assumes that the errors are binomially distributed. We won’t discuss the details of the binomial distribution, but you should know that it’s the statistical distribution of data you’d get from flipping a coin. For that reason, the binomial distribution produces errors that are all 0s or 1s, which, of course, is what we need for classification.
To elaborate on that, let’s show you three calls you could make to glmnet: regularized.fit <- glmnet(x, y) regularized.fit <- glmnet(x, y, family = 'gaussian') regularized.fit <- glmnet(x, y, family = 'binomial') The first call is the one we made earlier in this chapter to perform linear regression. The second call is actually equivalent to the first call, except that we’ve made explicit the default value of the family parameter, which is set to 'gaussian' automatically. The third call is the way we perform logistic regression. As you can see, switching an analysis from linear regression to logistic regression is just a matter of switching the error family.[11
The second is to convert these raw predictions into probabilities, which we can more easily interpret — though we’d have to do the thresholding again at 0.5 to get the 0/1 predictions we generated previously. To convert raw logistic regression predictions into probabilities, we’ll use the inv.logit function from the boot package: library('boot') inv.logit(predict(regularized.fit, newx = x, s = 0.001)) #1   0.992494427 #2   0.998132627 #3   0.992550485 ... #98  0.002578403 #99  0.003411583 #100 0.006989922 If you want to know the details of how inv.logit works, we recommend looking at the source code for that function to figure out the mathematical transformation that it’s computing. For our purposes, we just want you to understand that logistic regression expects its outputs to be transformed through the inverse logit function before you can produce probabilities as outputs. For that reason, logistic regression is often called the logit model.
As you’ll notice, the code to do this is essentially identical to the code we used earlier to fit our rank prediction model using linear regression: set.seed(1) performance <- data.frame() for (i in 1:250) {   indices <- sample(1:100, 80)   training.x <- x[indices, ]   training.y <- y[indices]   test.x <- x[-indices, ]   test.y <- y[-indices]   for (lambda in c(0.0001, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.5, 0.1))   {     glm.fit <- glmnet(training.x, training.y, family = 'binomial')     predicted.y <- ifelse(predict(glm.fit, test.x, s = lambda) > 0, 1, 0)     error.rate <- mean(predicted.y != test.y)     performance <- rbind(performance,                          data.frame(Lambda = lambda,                                     Iteration = i,                                     ErrorRate = error.rate))   } } The algorithmic changes in this code snippet relative to the one we used for linear regression are few: (1) the calls to glmnet, where we use the binomial error family parameter for logistic regression; (2) the thresholding step that produces 0/1 predictions from the raw logistic predictions; and (3) the use of error rates rather than RMSE as our measure of model performance. Another change you might notice is that we’ve chosen to perform 250 splits of the data rather than 50 so that we can get a better sense of our mean error rate for each setting of lambda. We’ve done this because the error rates end up being very close to 50%, and we wanted to confirm that we’re truly doing better than chance when making predictions.
Let’s go through some code that implements this approach: squared.error <- function(heights.weights, a, b) {   predictions <- with(heights.weights, height.to.weight(Height, a, b))   errors <- with(heights.weights, Weight - predictions)   return(sum(errors ^ 2)) } Let’s evaluate squared.error at some specific values of a and b to get a sense of how this works (the results are in Table 7-1): for (a in seq(-1, 1, by = 1)) {   for (b in seq(-1, 1, by = 1))   {     print(squared.error(heights.weights, a, b))   } }
Code Breaking as Optimization Moving beyond regression models, almost every algorithm in machine learning can be viewed as an optimization problem in which we try to minimize some measure of prediction error. But sometimes our parameters aren’t simple numbers, and so evaluating your error function at a single point doesn’t give you enough information about nearby points to use optim. For these problems, we could use grid search, but there are other approaches that work better than grid search. We’ll focus on one approach that’s fairly intuitive and very powerful. The big idea behind this new approach, which we’ll call stochastic optimization, is to move through the range of possible parameters slightly randomly, but making sure to head in directions where your error function tends to go down rather than up. This approach is related to a lot of popular optimization algorithms that you may have heard of, including simulated annealing, genetic algorithms, and Markov chain Monte Carlo (MCMC). The specific algorithm we’ll use is called the Metropolis method; versions of the Metropolis method power a lot of modern machine learning algorithms.
Our raw data set isn’t in the format we’d like to work with, so we need to do some preprocessing. The first step is to translate all of the raw datestamps in our data set to properly encoded date variables. To do that, we use the lubridate package from CRAN. This package provides a nice function called ymd that translates strings in year-month-day format into date objects: library('lubridate') prices <- transform(prices, Date = ymd(Date)) Once we’ve done this, we can use the cast function in the reshape library to create a data matrix like the table we saw earlier in this chapter. In this table, the rows will be days and the columns will be separate stocks. We do this as follows: library('reshape') date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close') The cast function has you specify which column should be used to define the rows in the output matrix on the lefthand side of the tilde, and the columns of the result are specified after the tilde. The actual entries in the result are specified using value.
After exploring the result of producing this big date-stock matrix, we notice that there are some missing entries. So we go back to the prices data set, remove missing entries, and then rerun cast: prices <- subset(prices, Date != ymd('2002-02-01')) prices <- subset(prices, Stock != 'DDR') date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')
Once that’s done, we can find the correlations between all of the numeric columns in the matrix using the cor function. After doing that, we turn the correlation matrix into a single numeric vector and create a density plot of the correlations to get a sense of both a) the mean correlation, and b) the frequency with which low correlations occur: cor.matrix <- cor(date.stock.matrix[,2:ncol(date.stock.matrix)]) correlations <- as.numeric(cor.matrix) ggplot(data.frame(Correlation = correlations),             aes(x = Correlation, fill = 1)) +   geom_density() +   opts(legend.position = 'none')
After doing that, we extract the parts of the DJI we’re interested in, which are the daily closing prices and the dates on which they were recorded. Because they’re in the opposite order of our current data set, we use the rev function to reverse them: dji <- with(dji.prices, rev(Close)) dates <- with(dji.prices, rev(Date))
Now we can make some simple graphical plots to compare our market index generated using PCA with the DJI: comparison <- data.frame(Date = dates, MarketIndex = market.index, DJI = dji) ggplot(comparison, aes(x = MarketIndex, y = DJI)) +   geom_point() +   geom_smooth(method = 'lm', se = FALSE)
can easily make that comparison. First, we use the melt function to get a data.frame that’s easy to work with for visualizing both indices at once. Then we make a line plot in which the x-axis is the date and the y-axis is the price of each index. alt.comparison <- melt(comparison, id.vars = 'Date') names(alt.comparison) <- c('Date', 'Index', 'Price') ggplot(alt.comparison, aes(x = Date, y = Price, group = Index, color = Index)) +   geom_point() +   geom_line() Our first pass doesn’t look so good, because the DJI takes on very high values, whereas our index takes on very small values. But we can fix that using scale, which puts both indices on a common scale: comparison <- transform(comparisonMarketIndex = -scale(MarketIndex)) comparison <- transform(comparisonDJI = scale(DJI)) alt.comparison <- melt(comparison, id.vars = 'Date') names(alt.comparison) <- c('Date', 'Index', 'Price') p <- ggplot(alt.comparison, aes(x = Date, y = Price, group = Index, color = Index)) +   geom_point() +   geom_line() print(p)
The files are located at data/roll_call/, so we use the list.files function to generate a character vector called data.files with all of the data filenames in it. library(foreign) library(ggplot2) data.dir <- "data/roll_call/" data.files <- list.files(data.dir)
The winning entries for the contest all used kNN as part of their recommendation algorithm, though it was generally only one part of many for the winning algorithms. The other important algorithm that was incorporated into most of the winning solutions is called matrix factorization. We won’t describe the algorithm in detail, but anyone building a production-level recommendation system should consider combining recommendations from both kNN and matrix factorization models. In fact, the best systems often combine kNN, matrix factorization, and other classifiers together into a super-model. The tricks for combining many classifiers together are usually called ensemble methods. We’ll discuss them briefly in Chapter 12.
What we’re going to do is convert this “long form” data into a “wide form” of data in which each row corresponds to a user and each column corresponds to a package. We can do that using the cast function from the reshape package: library('reshape') user.package.matrix <- cast(installations, User ~ Package, value = 'Installed') user.package.matrix[, 1] # [1]  1  3  4  5  6  7  8  9 11 13 14 15 16 19 21 23 25 26 27 28 29 30 31 33 34 #[26] 35 36 37 40 41 42 43 44 45 46 47 48 49 50 51 54 55 56 57 58 59 60 61 62 63 #[51] 64 65 user.package.matrix[, 2] # [1] 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 #[39] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 row.names(user.package.matrix) <- user.package.matrix[, 1] user.package.matrix <- user.package.matrix[, -1] First, we’ve used cast to make our user-package matrix. Once we inspect the first column, we realize that it just stores the user IDs, so we remove it after storing that information in the row.names of our matrix. Now we have a proper user-package matrix that makes it trivial to compute similarity measures. For simplicity, we’ll use the correlation between columns as our measure of similarity. To compute that, we can simply call the cor function from R:
If you are not familiar with RESTful APIs or have never heard of OAuth, don’t worry. For the purposes of this exercise we will not worry about those details. If you are interested in learning more about any of them, we suggest reading the documentation for the specific service you are attempting to access. In the case of Twitter, the best reference is the API FAQ: https://dev.twitter.com/docs/api-faq.
The idea with SGA is to use that single bit of information to collate your entire social graph across multiple services. If you’re curious about how long your trail of digital exhaust is in Google, type the following in your favorite web browser: https://socialgraph.googleapis.com/lookup?q=@EMAIL_ADDRESS&fme=1&pretty=1 If you replace the EMAIL_ADDRESS with a proper address, the API will return raw JSON to the browser window, which you can explore to see how much of your social graph Google is already storing! If you have a Twitter account registered to the email address you entered, you will likely also notice that your Twitter page’s URL shows up as one of the social graph services that resolves to the email address. Twitter is one of the many social networking sites the SGA crawls to store public social graph data. Therefore, we can use the SGA to query the social network of a specific user and then build out the network.
work through the rest of the case study. We begin by loading the R packages we will need to generate the Twitter graphs. We will be using three R packages for building these graphs from the SGA. The RCurl package provides an interface to the libcurl library, which we will use for making HTTP requests to the SGA. Next, we will use RJSONIO to parse the JSON returned by the SGA in the R lists. As it happens, both of these packages are developed by Duncan Temple Lang, who has developed many indispensable R packages (see http://www.omegahat.org/). Finally, we will use igraph to construct and store the network objects. The igraph package is a powerful R package for creating and manipulating graph objects, and its flexibility will be very valuable to us as we begin to work with our network data.