Fixed long standing bug in mtry when xvar.wt is specified as a real vector (not uniform, or integers). In such situations, mtry was not correctly being implemented. Generalized mtry to ytry which is now fully functional in the multivariate mixed case. Changed suggests from akima to interp. Updated homepage URL's. New home page with documentation and vignettes at https://www.randomforestsrc.org/.
Added a new way to deal with missing data in prediction where missing data is assigned randomly by daughter size. Use option na.action=“na.random”. Added enhanced capability to deal with test data with factors. Previously test data with factor levels not matching training data would throw an error. This then required users to apply hot encoding and other inferior workarounds. These workaround are no longer needed. Continued improvements to CPU times for training and prediction. Added a save.memory feature for big survival data to address slow training times that users were experiencing with large competing risk and right censored survival data. Added the ability for users to supply their own sigma matrix for the mahalanobis splitting rule. Added new features to subsample (used for VIMP confidence intervals). Users can now specify variables for joint VIMP confidence intervals and selectively plot confidence intervals using variables of their choosing. VIMP random now also uses daughter sizes as the secondary threshold after considering the primary vimp.threshold.
Miner changes to donttest examples that caused issues. Continued improvements to overall speed of the package.
Miner changes to donttest examples that caused issues. Continued improvements to overall speed of the package.
Minor changes to Rd files. Re-introduction of competing risk generalized logrank split rule (3.1.1 in the CR paper). This had been accidentally removed in CRAN build 2_9_0. Fix to DESCRIPTION URL per Prof. Ripley’s request.
Added the ability to get confidence intervals for out-of-bag error rates using the subsample function. Significant improvements in CPU times for restoration of models and prediction with new data. Fixed the ability to select between in situ ensemble calculation and post-forest ensemble calculation. This can be useful when doing these calculations in parallel causes significant contention for memory locations that are in critical OpenMP regions.
Fix to R_init per Prof. Ripley. Improved output from analysis, including more reported performance metrics. New get.brier.survival function for extracting Brier score in survival settings. Importance, when true, now defaults to anti. Fix to get.tree not playing well with seed for repeatability of importance.
Fix to missing data segfault related to GitHub issue 112. Mahalanobis splitting in multivariate regression settings when all outcomes are real-valued. This can have added benefit when outcomes are correlated in contrast to the default composite (independent) splitting rule. Improvements to the speed and output for missForest imputation. New home page with documentation and vignettes at https://luminwin.github.io/randomForestSRC
Fix to omp.h declaration before R headers per Prof. Ripley, in anticipation of clang 13.0.0.
Fixes to variable importance OpenMP threading inefficiencies related to GitHub issues 95,96,103. Variable importance is now repeatable via the seed argument, addressing GitHub issues 67, 69. Fix to new distance measure not being normalized, addressing GitHub issue 100 and fixed an issue related to factors in test data where levels were not matching traning data. Improvements to user-interface for partial.rfsrc(). Some improvements in speed and internal logic in the absence of missing data.
Implementation of rfsrc.anonymous() allowing one to save a model without the presence of training data, and subsequently predicting without it, as well. Case specific error and VIMP for classification and regression.
New additions to the package include sidClustering of unsupervised data using staggered interaction data (SID) (Mantero and Ishwaran, 2020). Implements the artificial two-class approach of Breiman (2003). Improved hold out VIMP calculated from the error rate of mini ensembles of trees (blocks of trees) grown with and without a variable. Applies to all families. Finally, visualize trees on your Safari or Google Chrome browser using the new get.tree function for extracting trees from a forest. Many thanks to @dbarg1 on GitHub for the initial prototype of this last function.
Bug fixes: Serious impute() issue introduced in the last build (Bug 59 on GitHub). Fix to compile time error on systems that do not support OpenMP. Here, the OpenMP code that was inadvertently exposed is now protected when OpenMP support is not detected. Also updated impute.rfsc.Rd with a salient example.
Bug fixes: Fix to Windows 10 OpenMP stack allocation error encountered when survival and large n are encountered (Bug 50 on GitHub). Fix to get.quantile.stat() (Bug 51 on GitHub).
Added new splitrules “auc” and “entropy” for classification. Improvements to quantile regression. Updates to holdout vimp. Some function names were changed as a general move towards name uniformity in the package. Updates to function imbalanced used for class imbalanced data.
Ensembles in regression now support Greenwald-Khanna approximate quantile queries via rfsrc(), predict.rfsrc() and the new wrapper quantileReg.rfsrc(). Related to this, a new split rule “quantile.regr” has been added. Specifications will be added to the GitHub page, shortly. Another new wrapper, imbalanced.rfsrc(), implements various solutions to the two-class imbalanced problem, including the newly proposed quantile-classifier approach of O’Brien and Ishwaran (2017). Also includes Breiman’s balanced random forests undersampling of the majority class. Performance is assesssed using the G-mean, but misclassification error can be requested. Also, the new parameter get.tree in predict.rfsrc() allows users to extract the ensembles for a single tree or subset of trees over the forest. Finally, the default nodesize for survival and competing risk has been changed to 15.
Three primary additions:
(1) Subsample Forests for VIMP Confidence Intervals: Uses subsampling to calculate confidence intervals and standard errors for VIMP (variable importance). Applies to all families.
(2) Tune Random Forest for the optimal mtry and nodesize parameters: Finds the optimal mtry and nodesize tuning parameter for a random forest using out-of-bag (OOB) error. Applies to all families.
(3) Fast approximate random forests: Uses subsampling with forest options set to encourage computational speed. Applies to all families.
Fix to predict() call not returning ensembles when y-vars not present. Sorry about that.
Serious improvements to OpenMP performance after addressing issues related to the blocking of threads during a number of calculations involving ensembles, importance, forest weights, and partial plots mostly in big-n data sets.
Addition of configure file to source package allowing more accessible OpenMP parallel execution on systems that support it.
Introduction of conditional quantiles for a regression forest. Applies to both univariate and multivariate forests. Can be used for both training and testing purposes and returns the conditional quantiles for the target outcomes, and conditional density, which can be used to calculate conditional moments, such as the mean and standard deviation.
Bug fixes to partial.rfsrc() on R and C side. Allowance of second order variable specification in this analysis. Conditional importance values in classification adjusted by a factor of exp(1). Bug fix to unsafe threading in LB-VIMP calculations.
Fix to custom splitting family verification and registration harness. Introduction of bootstrap=“by.user”. Fix to incorrect mapping of user specified time points to event times when ntime option is used. It is recommended that the use of this option be avoided. The effect of discretizing the time values compromises the ensembles. For best results, all event times must be used. This was and is the default behaviour. Fix to incorrect passing of time option parameter in plot.variable(). Introduction of partial.rfsrc() to allow direct access to partial ensembles. Added support for long vectors on native code side.
Fix to levels.count when ntree=1. Some n-based loop optimization. Significant improvements in CPU times for restore-predict modes, and plot.variable(). Consequent changes to forest object, and incompatibility with objects created with previous versions of the package. Default is now importance=FALSE in predict.rfsrc().
Bug fix to coerce.factor option via get.xvar.nlevels() and get.yvar.nlevels() resolved by sending in max instead of number of levels. Bug fix to VIMP that potentially occurs in OpenMP mode causing non-zero LB-VIMP. Methodological fix to in-node imputation. and removal of na.random. Consequent incompatibility with objects created with previous versions of the package. Fix to rfsrcSyn() bug pertaining to colnames of test set synthetic features. Introduction of sampsize, samptype, and case.wt to address imbalanced data sets. Continued improvements to CPU and memory performance in big-n, big-p, and big-ntree scenarios.
Change to GROW mode default importance=none and to allow importance=TRUE. Addition of user trace with time estimates. CPU usage - code optimization of ensemble calculations. CPU usage - code optimization of imputation. Fix to R-side parsing of ensembles in multivariate classification. Change to treat ordered factors under classification setting instead of regression.
Fix to factor coercion option in responses. Fix to R-side processing of err.rate and importance in multivariate families with classification. Update of OPENMP protocols per CRAN recommendation. Expansion of fast.restore option to omit performance on every tree, and update to associated Rd file.
Fix to bug in dimensioning of predict object in survival families. Added documentation for custom splitting.
Multivariate capabilities added. Custom splitting harness modifications. Redefinition of nodesize to allow terminal nodes less than said size, subject to the initial test for 2 x nodesize before the split, maximum depth, and purity. Various bug fixes.
Fix to donttest example in rfsrc.Rd, and other adjustments per CRAN packaging protocols.
Bug fix to duplicating missingness protocol when restoring a forest. Added fast.restore option to grow call. Change to pass through xvar.wt as entered by user. RAM profile reduction in vimp(). Added versioning checks of forest object, thanks to suggestions by John Ehrlinger. Bug fix to allow logical responses, treated as reals. User trace functionality restored.
Bug fix to daughter assignment in classification. Significant RAM optimization in all modes.
Addition of new function stat.split() for extracting information from tree node splitting-statistics. Added more functionality to rfsrcSyn() for fitting synthetic random forests.
Addition of rfsrcSyn() function to grow a synthetic random forest (RF) using RF machines as synthetic features. Applies only to regression and classification settings. Used for prediction only.
Fix to non-standard GCC errors and warnings. Fix to bug in split rules related to omission of missing individuals in the split statistic. Minor R-side fixes.
Significant improvements to CPU and RAM usage profiles in serial and OpenMP modes of execution. Proximity options allow inbag, OOB and all. VIMP implements sub-setting and conditional variable importance. NA options allow the split statistic to be based on non-missing values only. In addition it allows random assignment of missing values.
Modification of terminal node imputation protocol. We now assign all individuals the same value rather than sampling from the distribution. Implementation of split.null option. Implementation of unsupervised splitting for missing data in impute.rfsrc(). Modification of nimpute
1 protocols. In-bag, OOB, and all now depend on the mode. Reduction in impute memory footprint. Modification of proximity option to allow in-bag, OOB, and all. Fixed bug in predict involving manual formula calls. Fixed bug in find.interaction involving specifying covariate names. Changes relating to Undefined Behaviour Sanitizer.
Initial re-engineering of memory footprint for imputation. Performance enhancements to split rules.
Competing risks now implements two distinct splitting rules for identifying short term risks affecting the cause-specific hazard or long term predictions affecting the cumulative incidence function. The plot.variable function now returns, and can reuse, a plot.variable data structure object for user convenience. Thanks to John Ehrlinger for this improvement. Other minor bug fixes, and enhancements.
Fix to [S] missingness check when all status are non-censored. Fix to [S] summary imputation of time. Fix to variables used all.trees output. Fix to manual formula interface. Removed big.data option. Added ntime option for survival families.
Replaced ‘suggests multicore’ with ‘depends parallel’. Followed protocol in parallel package for controlling number of cores, via options(), and environment variables. Reduced [S] memory footprint by prematurely de-allocating terminal node information.