An OpenBLAS-based Rblas for Windows 64

One of the more important pieces of software that powers R is its BLAS, which stands for Basic Linear Algebra Subprograms. This is the suite of programs which, as its name implies, performs basic linear algebra routines such as vector copying, scaling and dot products; linear combinations; and matrix multiplication. It is the engine on which LINPACK, EISPACK, and the now industry-standard LAPACK and its variants are built. As a lower-level software suite, the BLAS can be fine-tuned to different computer architectures, and a tuned BLAS can create a dramatic speedup in many fundamental R routines. There are a number of projects developing tuned BLAS suites such as the open source ATLAS, GotoBLAS2, and OpenBLAS, and the closed source Intel MKL and AMD ACML libraries.

In ‘R’ itself, the BLAS is stored in the shared library file Rblas.dll, and a properly compiled tuned version can just be “dropped in” to the bin subdirectory and used immediately. For people working in Linux, there is significant support for specially tuned BLAS (and LAPACK) files, which can be found in great detail in the R Installation manual—the support for Windows is somewhat less robust, to be charitable. For many years, I was constrained to working in a 32-bit Windows environment, and took advantage of the 32-bit Windows compiled versions that existed in R’s repository of tuned BLAS files. However, at that time, the most recent version was for a Pentium 4, so over the next few years, I struggled and compiled ATLAS-based Rblas’s for Windows 32-bit. I did this for the Core 2 Duo (C2D) and the quad-core SandyBridge (Core i7 called C2i7 in the repository). The speedup was dramatic (see this blog, which uses the Core 2 BLAS I compiled, for an example) However, I found ATLAS to be difficult to compile, subject to any number of issues, and not really being a programmer, I often ran into issues I could not solve. Moreover, I was never able to successfully compile a 64-bit BLAS which passed the comprehensive make check-all test suite.

Recently, I made the complete switch to 64 bit Windows at both work and home so finding a 64-bit Rblas become much more important. There are some pre-compiled 64-bit BLAS files for R, graciously compiled by Dr. Ei-ji Nakama, which can be found here. Using these binaries, I found a dramatic increase in speed over the reference BLAS. However, the most recent processor-specific BLAS in that repository is for the Nehalem architecture, and cannot take advantage of the SSE4 and AVX operations built into the new SandyBridge and IvyBridge cores. Much to my frustration, I had many failures trying to compile a 64-bit ATLAS-based BLAS which, when compiled in R, would pass all the checks. With each try taking between six and a dozen hours, I eventually gave up trying to use ATLAS and resigned myself to living with the GotoBLAS-based files—which honestly was not much of a resignation.

A bit more recently, came across the OpenBLAS project, which claims to have reached near-MKL speeds on the Sandy/IvyBridge architecture due to some elegant hand-coding and optimization, and I was hoping to be able to take advantage of this in R. Unfortunately, there are no pre-compiled binaries to make use of, and so I had to attempt the compilation on my own. What made this a bit more difficult is that officially, R for Windows does not support using Goto or ACML based BLAS routines, and even ATLAS has limited support (see this thread, specifically Dr. Ripley’s response). This called for a lot of trial and error, originally resulting in dismal failure.

Serendipitously, around the time of the 3.0.1 release, there was an OpenBLAS update as well. Trying again, I was finally successful in compiling a single-threaded, OpenBLAS v2.8 based BLAS for the SandyBridge architecture on Windows 64 bit that, when used in the R compilation, created an Rblas that passed make-check all! For those interested in compiling their own, once the OpenBLAS is compiled, it can be treated as an ATLAS blas in R’s Makefile.local, with the only additional change being pointing to the compiled .a file in \src\extra\blas\Makefile.win.

Once I was successful with the SandyBridge-specific file, I compiled an Rblas that was not Sandy-Bridge dependent, but could be used on any i386 machine. I plan on submitting both to Dr. Uwe Ligges at R, and hope that, like the other BLAS’s I submitted, they will be posted eventually.

To demonstrate the increase in speed that using a tuned BLAS can provide in R, I ran a few tests. First, I created two 1000×1000 matrices populated with random normal variables. In specific (I don’t know why the first assignment operator has an extra space, I’ve seen that before when posting to WordPress):

A <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)
B <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)
write.csv(A, file="C:/R/A.csv", row.names=FALSE)
write.csv(B, file="C:/R/B.csv", row.names=FALSE)

A <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)

B <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)

write.csv(A, file="C:/R/A.csv", row.names=FALSE)

write.csv(B, file="C:/R/B.csv", row.names=FALSE)

I can provide the specific matrices for anyone interested. I then compiled a basically vanilla R 3.0.2 for 64 bit windows, using only -mtune=corei7-avx -O3 for optimizations, so the code should run on any i386. I followed the compilation steps for a full installation, so it included base, bitmapdll, cairodevices, recommended, vignettes, manuals, and rinstaller for completeness. Using a Dell M4700 (i7-3740 QM 2.7Ghz, Windows 7 Professional 64bit, 8GB RAM) I tested the following BLAS’s:

Reference
GotoBLAS Dynamic
GotoBLAS Nehalem
OpenBLAS Dynamic
OpenBLAS SandyBridge

I updated all packages and installed the microbenchmark package (all from source). To test the effects of the different BLAS’s I renamed and copied the appropriate blas to Rblas.dll each time. The test ran multiple copies of crossprod, solve, qr, svd, eigen, and lu (the last needs the Matrix package, but it is a recommended package). The actual test code is:

library(microbenchmark)
library(Matrix)
A <- as.matrix(read.csv(file="C:/R/BLAS/A.csv", colClasses='numeric'))
B <- as.matrix(read.csv(file="C:/R/BLAS/B.csv", colClasses='numeric'))
colnames(A) <- colnames(B) < - NULL
microbenchmark(crossprod(A,B), solve(A), qr(A, LAPACK=TRUE), svd(A), eigen(A),
               lu(A), times=100L, unit='ms')

library(microbenchmark)

library(Matrix)

A <- as.matrix(read.csv(file="C:/R/BLAS/A.csv", colClasses='numeric'))

B <- as.matrix(read.csv(file="C:/R/BLAS/B.csv", colClasses='numeric'))

colnames(A) <- colnames(B) < - NULL

microbenchmark(crossprod(A,B), solve(A), qr(A, LAPACK=TRUE), svd(A), eigen(A),

lu(A), times=100L, unit='ms')

The results are illuminating:

Reference BLAS:

Unit: milliseconds
                 expr       min        lq    median        uq       max neval
      crossprod(A, B) 1173.4761 1184.7500 1190.9430 1198.4409 1291.9620   100
             solve(A) 1000.1138 1011.5769 1018.7613 1027.0789 1172.6551   100
 qr(A, LAPACK = TRUE)  625.3756  633.2889  638.0123  644.8645  713.7831   100
               svd(A) 3045.0855 3074.7472 3093.8418 3138.0180 3621.9590   100
             eigen(A) 5491.4986 5563.1129 5586.9327 5632.5695 5836.6824   100
                lu(A)  170.8486  172.5362  175.5632  178.7300  247.1116   100

Reference BLAS:

Unit: milliseconds

expr min lq median uq max neval

crossprod(A, B) 1173.4761 1184.7500 1190.9430 1198.4409 1291.9620 100

solve(A) 1000.1138 1011.5769 1018.7613 1027.0789 1172.6551 100

qr(A, LAPACK = TRUE) 625.3756 633.2889 638.0123 644.8645 713.7831 100

svd(A) 3045.0855 3074.7472 3093.8418 3138.0180 3621.9590 100

eigen(A) 5491.4986 5563.1129 5586.9327 5632.5695 5836.6824 100

lu(A) 170.8486 172.5362 175.5632 178.7300 247.1116 100

GotoBLAS_Dynamic

Unit: milliseconds
                 expr         min          lq      median         uq        max neval
      crossprod(A, B)    63.68471    83.16222    92.33265   104.5118   184.4941    20
             solve(A)   207.38801   229.81548   250.53656   276.6736   391.5336    20
 qr(A, LAPACK = TRUE)   171.93558   175.22074   178.57133   186.3217   996.7584    20
               svd(A)   892.90194   920.55781   969.79708  1069.5740 12185.9414    20
             eigen(A) 14870.65481 14943.13804 15113.49240 15513.8285 29100.0788    20
                lu(A)    70.65831    76.07561    83.52785   146.1552   152.5949    20

GotoBLAS_Dynamic

Unit: milliseconds

expr min lq median uq max neval

crossprod(A, B) 63.68471 83.16222 92.33265 104.5118 184.4941 20

solve(A) 207.38801 229.81548 250.53656 276.6736 391.5336 20

qr(A, LAPACK = TRUE) 171.93558 175.22074 178.57133 186.3217 996.7584 20

svd(A) 892.90194 920.55781 969.79708 1069.5740 12185.9414 20

eigen(A) 14870.65481 14943.13804 15113.49240 15513.8285 29100.0788 20

lu(A) 70.65831 76.07561 83.52785 146.1552 152.5949 20

GotoBLAS_Nehalem

Unit: milliseconds
                 expr         min          lq      median          uq         max neval
      crossprod(A, B)    52.86457    64.42491    73.75266    77.89007    82.05677    20
             solve(A)   203.96266   209.78135   220.36275   229.09164   298.06924    20
 qr(A, LAPACK = TRUE)   171.22182   174.58935   175.90882   181.63144   247.92046    20
               svd(A)   895.20834   904.17256   950.78584   970.47127  1057.90653    20
             eigen(A) 15429.49102 15470.62708 15552.95074 15627.39700 15759.67103    20
                lu(A)    67.13746    72.27017    74.26725    77.52482    85.24281    20

GotoBLAS_Nehalem

Unit: milliseconds

expr min lq median uq max neval

crossprod(A, B) 52.86457 64.42491 73.75266 77.89007 82.05677 20

solve(A) 203.96266 209.78135 220.36275 229.09164 298.06924 20

qr(A, LAPACK = TRUE) 171.22182 174.58935 175.90882 181.63144 247.92046 20

svd(A) 895.20834 904.17256 950.78584 970.47127 1057.90653 20

eigen(A) 15429.49102 15470.62708 15552.95074 15627.39700 15759.67103 20

lu(A) 67.13746 72.27017 74.26725 77.52482 85.24281 20

OpenBLAS_Dynamic

Unit: milliseconds
                 expr        min         lq     median         uq       max neval
      crossprod(A, B)  102.54157  102.99851  104.42070  105.95056  177.8572   100
             solve(A)  180.64641  182.99257  186.41507  188.91836  261.6805   100
 qr(A, LAPACK = TRUE)  188.08381  194.37999  197.10319  199.66507  281.2404   100
               svd(A) 1027.00070 1040.84424 1055.35816 1108.67206 1193.3480   100
             eigen(A) 2114.96652 2190.11265 2208.06810 2250.41861 2476.9335   100
                lu(A)   57.47394   58.53336   59.80013   62.60323  139.1076   100

OpenBLAS_Dynamic

Unit: milliseconds

expr min lq median uq max neval

crossprod(A, B) 102.54157 102.99851 104.42070 105.95056 177.8572 100

solve(A) 180.64641 182.99257 186.41507 188.91836 261.6805 100

qr(A, LAPACK = TRUE) 188.08381 194.37999 197.10319 199.66507 281.2404 100

svd(A) 1027.00070 1040.84424 1055.35816 1108.67206 1193.3480 100

eigen(A) 2114.96652 2190.11265 2208.06810 2250.41861 2476.9335 100

lu(A) 57.47394 58.53336 59.80013 62.60323 139.1076 100

OpenBLAS_SandyBridge

Unit: milliseconds
                 expr        min         lq     median        uq       max neval
      crossprod(A, B)  102.41031  102.71278  104.14695  105.4141  179.8048   100
             solve(A)  180.52238  181.97006  184.94760  187.6423  260.5379   100
 qr(A, LAPACK = TRUE)  186.89180  191.01798  193.69381  199.2825  286.8123   100
               svd(A) 1024.47896 1034.60627 1043.91214 1099.8361 1153.3881   100
             eigen(A) 2108.22006 2172.05087 2192.65913 2216.7981 2375.9831   100
                lu(A)   57.55384   59.10672   61.08402   62.9213  133.6456   100

OpenBLAS_SandyBridge

Unit: milliseconds

expr min lq median uq max neval

crossprod(A, B) 102.41031 102.71278 104.14695 105.4141 179.8048 100

solve(A) 180.52238 181.97006 184.94760 187.6423 260.5379 100

qr(A, LAPACK = TRUE) 186.89180 191.01798 193.69381 199.2825 286.8123 100

svd(A) 1024.47896 1034.60627 1043.91214 1099.8361 1153.3881 100

eigen(A) 2108.22006 2172.05087 2192.65913 2216.7981 2375.9831 100

lu(A) 57.55384 59.10672 61.08402 62.9213 133.6456 100

All the tuned BLAS results are much better than the reference, with the exception of eigenvalue decomposition for the GotoBLAS-based Rblas. I do not know why that is the case, but the difference was so severe that I had to run it only 20 times to have results in reasonable time. For the other routines, sometimes the OpenBLAS based version is quicker, other times not. I personally will use it exclusively as the minor lag in comparison to some of the GotoBLAS timings is more than compensated for by the eigenvalue speedup, and overall, it is still anywhere between 3 and 10 times faster than the reference BLAS. Regardless, it is clear that using a tuned BLAS can speed up calculations considerably.

Using a tuned BLAS is not the only way to increase R’s speed. For anyone compiling R for themselves, throwing the proper flags for R in its compilation can squeeze a bit more speed out if it, and activating byte-compilation can do a bit more. In my next post, I hope to show similar timing numbers, but this time, using an R compiled for my specific machine (Ivy Bridge) in concert with the tuned BLAS.

6 Responses

Andrew Ziem October 23, 2013 at 12:11 PM | Permalink

I’m looking forward to something like this becoming generally available. I usually use R on a very similar system (R 3.0, Dell, Intel i7 CPU, 16GB RAM, Windows 64-bit), though I’m not sure how much it would really help me because most of my wait time (which can be hours or days) is spent in the gbm, earth, and nnet packages, which I think do their calculations in their own C code.
1. Avraham October 23, 2013 at 7:05 PM | Permalink
  
  I’m not that familiar with those packages, Andrew, but it stands to reason. The nnet source package has C source code which contains functions like “sigmoid” and “Build_Net” and the gbm has a slew of C++ files, so it is likely that a faster BLAS will not help too much, although it cannot hurt to try. Have you considered porting any specific routines you have built, such as using the Rcpp package?
John Nicholas October 23, 2013 at 12:39 PM | Permalink

What Blas should (or could) I use for:
Intel Xeon E5620 Westmere 2.4GHz (4-core)?

Thanks.
1. Avraham October 23, 2013 at 7:10 PM | Permalink
  
  According to Wikipedia, the Westmere is the architecture between the Nehalem and the SandyBridge, and it has the AES instruction set but not the AVX instruction set, so I would suggest the GotoBLAS compiled for Nehalem for now. If (hopefully when) the dynamic OpenBLAS I submitted to CRAN gets approved, that would be another option as well.
An OpenBLAS-based Rblas for Windows 64: Step-by-step | Strange Attractors October 24, 2013 at 1:00 PM | Permalink

[…] ← An OpenBLAS-based Rblas for Windows 64 […]
R with GotoBLAS on Windows 10 | Matt Moores February 6, 2016 at 2:12 PM | Permalink

[…] in the R Installation and Administration guide to switch over to a more efficient implementation. Avraham Adler, Tony Fischetti and Zachary Mayer have written similar blog posts on this topic. I use the […]

Comments are closed.