# 10 million steps with R

## Random walk data generation and plotting in R^{3} (9.01.13)

I have a particular fascination with code that is computationally intense, yet very dense in terms of lines. A sort of David / Goliath kinda thing, where a few choice keystrokes can take hours to process. Additionally, when the computation can result in aesthetically pleasing output, like a 3D plot of a random walk using R's `rgl`

package, it's hard to resist tapping away on the keyboard for a bit.

One simple and easy example of this sort of David-script is random walk data generation. With a bunch of free time this Labor-Day Weekend, I decided to take R for a long walk. Here is the code for a single step:

Since R is all about concise vector manipulation, I decided to implement the walk through iterative vector addition. Given a previous position vector, in R^{3}, a random "move" vector (`m`

) whose components are each drawn at random from the set (-1,0,1) was created and added. To turn the steps into a walk, I decided on a number of iterations, and built up the dataset.

It was at this point that I got a real-world example of the overhead differences between raw, fixed dimension, matrices and unbounded data frames. My initial thought was to build up a data frame object from scratch, as follows:

After running `walk()`

, as defined above and for 1,000,000 steps, my computer quickly became unresponsive. At first I thought that this was simply due to the large number of iterations, which caused each random sample call to add up. However, having just poked around in Norman Matloff's *The Art of R Programming* , I remembered the discussion around R data object types.

My original implementation turned out to have 2 unnecessarily taxing components. Firstly, the fact that data frames are an extension of the list class means that they inherit a bunch of unnecessary attributes for lists, such as arbitrary component types. Secondly, building up the data frame up from a small original size incurs huge cumulative costs, as with each call of

path[s,] <- step(path[s-1,])

R must copy the entire data frame, create a new one which is one row larger, and assign the new step.

To avoid both of these issues, I decided to define a correctly sized matrix object up front, and assign each step to the relevant index. The definition for `walk()`

then became the following:

Now, as the appropriate amount of memory for the storage object has been declared up front, there is no need to create a new, expanded, matrix at each step. Using the `system.time()`

command, we can see the immense efficiency increase.

> steps <- 100000 # Using an unbounded data frame > system.time(path <- walk(steps)) user system elapsed 2257.740 85.384 2337.593 # Using a defined matrix > system.time(path <- walk(steps)) user system elapsed 1.050 0.005 1.059

Having determined a speedier method, I walked R for 10 million steps and plotted the matrix using `plot3d()`

from `rgl`

. Here was the path color coded chronologically, starting at the origin (click for a gif, but beware - 17MB):

As this little project was all about making R work really hard, I decided to test the limits of my computer, as well as R's ability to deal with really big data. With the knowledge that R's maximum vector length is (2^{31} - 1) elements , I calculated the longest possible walk R could take. Since a matrix is just a long vector of each column sequentially concatenated, the longest 3 dimensional walk possible is (2^{31} - 1)/3 = 715,827,882 steps.

After running the long(est) walk overnight (with around 3 hours of processing time) R saved a max-length walk dataset. However, my computer understandably refused to plot it, so I'll leave it here for anyone with interest and serious computational resources .

Kudos to Inside-R for their cool syntax highlighting tool .

comments powered by Disqus