After finishing “House of Cards” I am on a binge studying of the Stanford online course “Statistical Learning”. That is the best ever class on the subject: comprehendible explanations, no rush, appropriate jokes, no excessive material, the free book for download that closely follows the material of the class, and lovely R sessions running by Trevor Hastie himself.
I have learned new tricks in R. For example, a handy matplot() function, which output is shown above. On one plot, it shows values for each variable (column) for each row (X-axes) in the data frame. It can be seen that there is strong autocorrelation between consecutive rows, or about 10-20 repeats of every data point — so the sample size is in effect much smaller than the number of rows. That would result in a serious underestimate of the standard error (s.e.) for regression coefficients, if to model y (values are shown in black on the plot) as a linear function of x1 (in red) and x2 (in blue): y = b0 + b1*x1 + b2* x2. Thus, for b1, the standard bootstrap gave s.e. of 0.028, while the block bootstrap (by blocks of 100 rows), 0.196.
While I was satisfying my mental curiosity and was busy with launching a new project at work (implementation of a new analytical tool for IEDB.org to be launched in summer), a lot has happened in the world of Big Data that I crave for comprehension and reporting.