Binge studying “Statistical Learning with R” from Stanford

Rplot

After finishing “House of Cards” I am on a binge studying of the Stanford online course “Statistical Learning”. That is the best ever class on the subject: comprehendible explanations, no rush, appropriate jokes, no excessive material, the free book for download that closely follows the material of the class, and lovely R sessions running by Trevor Hastie himself.

I have learned new tricks in R. For example, a handy matplot() function, which output is shown above. On one plot, it shows values for each variable (column) for each row (X-axes) in the data frame. It can be seen that there is strong autocorrelation between consecutive rows, or about 10-20 repeats of every data point — so the sample size is in effect much smaller than the number of rows. That would result in a serious underestimate of the standard error (s.e.) for regression coefficients, if to model y (values are shown in black on the plot) as a linear function of x1 (in red) and x2 (in blue): y = b0 + b1*x1 + b2* x2. Thus, for b1, the standard bootstrap gave s.e. of 0.028, while the block bootstrap (by blocks of 100 rows), 0.196.

While I was satisfying my mental curiosity and was busy with launching a new project at work (implementation of a new analytical tool for IEDB.org to be launched in summer), a lot has happened in the world of Big Data that I crave for comprehension and reporting.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s