R:在randomForest()调用中使用foreach()和sample()过程

发表于:2021-01-22

I have a large dataframe (~700 n x 36000 p) and plan to conduct randomForest analyses in R. Due to the runtime burdens of sending the full frame to randomForest (even with parallel computing and 512 GB RAM), I would like to send different random subsamples of the dataframe (~5% p) to randomForest in many independent runs (Nruns). For smaller dataframes, I have created a foreach loop to send the entire dataframe to randomForest and return a matrix of importance results that is dim(p,Nruns) plus 3 additional rows containing some additional information generated in each Nrun. However, I am having trouble constructing the foreach() component of the script to send a different subsample of the dataframe to randomForest for each run. (The subsampling consists of two steps: Create a balanced dataset (on outcome class) by sampling rows first (this part works), then select a subset of columns.) The desired results would still be a dataframe of dim(p+3, Nruns) but each column would contain results only for the variables that were randomly-selected in the run represented by that column (i.e., there would be missing values for the variables not selected for that run). When I submit the code below (using the fake data created below), I get the following error: "error calling combine function: " Note, as indicated in the code, if I exclude the step where random columns are selected, but retain the step where the balancing is done, I do not get an error and the output is as expected (with dim(p+3,Nruns) and all cells have non-zero values.) So, the problem is in the section of code where the column sampling is done. I would like to know if anyone can suggest a remedy to the below code that will do a new random subsampling of columns (and rows) for each of 1:Nruns.I have a large dataframe (~700 n x 36000 p) and

分享到：

非常感谢你花费了来阅读本文,如果你在本站获取到了新知识,那就请点击分享按钮将本站分享出去吧。