R is not well-suited for working with data larger than 10%-20% of a computer’s RAM
— The R installation and administration Manual.
If the RAM is not enough, the data will be moved to disk. If the disk is not enough, the program will crash.
Also, since the disk is much slow than the RAM, execution time increases.
Move a subset into RAM
Process the subset
Keep the result and discard the subset
big.matrix
?20% of the size of RAM
Dense matrices
bigmemory implements the big.matrix
data type, which is used to create, store, access, and manipulate matrices stored on the disk
Data are kept on the disk and moved to RAM implicitly
A big.matrix
object:
big.matrix
and matrix
big.matrix
and matrix
big.matrix
is stored on the diskbig.matrix
can be shared with multiple R sessionsbig.matrix
is reference typebig.matrix
object will not have a copy when assign to a variabledeepcopy
to copy big.matrix
object explicitlybig.matrix
objectbig.matrix
objectNow that the big.matrix
object is on the disk, we can use the information stored in the descriptor file to instantly make it available during an R session. This means that you don’t have to reimport the data set, which takes more time for larger files. You can simply point the bigmemory
package at the existing structures on the disk and begin accessing data without the wait.
Error in attach.big.matrix("iris.desc") :
could not find function "attach.big.matrix"
If you want to copy a big.matrix
object, then you need to use the deepcopy()
function. This can be useful, especially if you want to create smaller big.matrix
objects.
[1] 5.1
big.matrix
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
[1,] 5 3.6 1.4 0.2 1
[2,] 5 3.4 1.5 0.2 1
[3,] 5 3.0 1.6 0.2 1
[4,] 5 3.2 1.2 0.2 1
[5,] 5 3.3 1.4 0.2 1
which
will extract the corresponding columns, then do the logical comparison, finally return the logical values whose length is equal to the column length. mwhich()
requires no memory overhead and returns only a vector of indices of length.
biganalytics
: summarizingbigtabulate
: splitting and tabulatingbigalgebra
bigpca
: pcabigFastLM
: linear modelsbiglasso
: penalized linear regression and logistic regressionbigrf
: random forestA final advantage to using big.matrix
is that if you know how to use R’s matrices, then you know how to use a big.matrix.
You can subset columns and rows just as you would a regular matrix, using a numeric or character vector and the object returned is an R matrix.
Likewise, assignments are the same as with R matrices and after those assignments are made they are stored on disk and can be used in the current and future R sessions.
One thing to remember is that $
is not valid for getting a column of either a matrix
or a big.matrix.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
[1,] 5.1 3.5 1.4 0.2 1
[2,] 4.9 3.0 1.4 0.2 1
[3,] 4.7 3.2 1.3 0.2 1
0.1 0.2 0.3 0.4 0.5 0.6 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5
5 29 7 7 1 1 7 3 5 13 8 12 4 2 12 5 6 6 3 8 3 3
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.843333 3.057333 3.758000 1.199333 2.000000
min max mean NAs
Sepal.Length 4.300000 7.900000 5.843333 0.000000
Sepal.Width 2.000000 4.400000 3.057333 0.000000
Petal.Length 1.000000 6.900000 3.758000 0.000000
Petal.Width 0.100000 2.500000 1.199333 0.000000
Species 1.000000 3.000000 2.000000 0.000000
The bigtabulate
package provides optimized routines for creating tables and splitting the rows of big.matrix objects.
Loading required package: biganalytics
Loading required package: foreach
Loading required package: biglm
Loading required package: DBI
2 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.4
1 3 4 3 8 5 9 14 10 26 11 13 6 12 6 4 3 6 2 1 1 1 1
0.1 0.2 0.3 0.4
4.3 1 0
4.4 0 1 0 0
4.6 0 1 1 0
4.7 0 1 0 0
4.8 1 1 0 0
4.9 1 1 0 0
5 0 2 0 0
5.1 0 1 2 0
5.4 0 1 0 2
5.7 0 0 1 1
5.8 0 1 0 0
List of 3
$ 1: int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
$ 2: int [1:50] 51 52 53 54 55 56 57 58 59 60 ...
$ 3: int [1:50] 101 102 103 104 105 106 107 108 109 110 ...
spe.mean.and.median <- Map(function(rows) c(mean(big.iris[rows, 1]), median(big.iris[rows, 1])), spl)
str(spe.mean.and.median)
List of 3
$ 1: num [1:2] 5.01 5
$ 2: num [1:2] 5.94 5.9
$ 3: num [1:2] 6.59 6.5
spe.summary <- Reduce(rbind, spe.mean.and.median)
dimnames(spe.summary) <- list(unique(levels(iris[,5])), c("mean", "median"))
spe.summary
mean median
setosa 5.006 5.0
versicolor 5.936 5.9
virginica 6.588 6.5
Large data regression model: biglm(formula = formula, data = data, ...)
Sample size = 150
Coef (95% CI) SE p
(Intercept) 2.2491 1.7532 2.7451 0.2480 0
Sepal.Width 0.5955 0.4569 0.7342 0.0693 0
Petal.Length 0.4719 0.4377 0.5062 0.0171 0
Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)
Residuals:
Min
-0.96159 -0.23489 0.00077 0.21453 0.78557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.24914 0.24797 9.07 7.04e-16 ***
Sepal.Width 0.59552 0.06933 8.59 1.16e-14 ***
Petal.Length 0.47192 0.01712 27.57 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3333 on 147 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.838
F-statistic: 386.4 on 2 147 DF, p-value: < 2.2e-16
library(oem)
y <- rnorm(nrows) + bigmat[,1] - bigmat[,2]
x
x <- deepcopy(big.iris, cols = 1:3)
y <- rnorm(150)
fit <- big.oem(x = x, y = y,
penalty = c("lasso"))
fit$beta
fit2 <- oem(x = bigmat[,], y = y,
penalty = c("lasso", "grp.lasso"),
groups = rep(1:20, each = 5))
An object of class
You can use bigmemory when your data are
bigsparser
Underlying data structures are compatible with low-level linear algebra libraries for fast model fitting. (Use c/c++ libraries directly)
if you have different column types, you could try the ff
pacakge, which is similar to bigmemory but includes a data.frame like data structure.
A big.matrix object is a data structure designed for random access.
big.matrix
object