impute.learn.rfsrc.RdLearns a predictive imputer from training data for later use on new data.
If the training data contain missing values, the function first
imputes them using impute. It then fits one saved full-sweep
learner per selected target on the completed training data and reuses
those learners later to update missing values in new data without
refitting on the test set.
If the training data are complete and target.mode = "all",
the initial training-data imputation step is skipped and the
full-sweep learners are fit directly from the complete training data.
impute.learn.rfsrc(formula, data,
ntree = 100, nodesize = 1, nsplit = 10,
nimpute = 2, fast = FALSE, blocks,
mf.q, max.iter = 10, eps = 0.01,
ytry = NULL, always.use = NULL, verbose = TRUE,
...,
full.sweep.options = list(ntree = 100, nsplit = 10),
target.mode = c("missing.only", "all"),
deployment.xvars = NULL,
anonymous = TRUE,
learner.prefix = "impute.learner.",
learner.root = "learners",
out.dir = NULL,
wipe = TRUE,
keep.models = is.null(out.dir),
keep.ximp = FALSE,
save.on.fit = !is.null(out.dir))
save.impute.learn.rfsrc(object, path, wipe = TRUE, verbose = TRUE)
load.impute.learn.rfsrc(path, targets = NULL, lazy = TRUE, verbose = TRUE)
# S3 method for class 'impute.learn.rfsrc'
predict(object, newdata,
max.predict.iter = 3L,
eps = 1e-3,
targets = NULL,
restore.integer = TRUE,
cache.learners = c("session", "none", "all"),
verbose = TRUE,
...)A symbolic model description. Can be omitted. The same
interpretation as in impute is used for the initial
training-data imputation stage. The saved full-sweep learner bank is
controlled by deployment.xvars, not by formula.
Training data. Variables that are not real-valued are coerced to factors before fitting when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are dropped before the training schema is recorded.
Arguments passed to
impute for the initial training-data imputation. The
argument full.sweep is controlled internally and should not
be supplied.
Controls the imputation engine used by impute. If
mf.q = 1, training uses standard missForest. If
mf.q > 1, training uses the multivariate missForest
generalization. If mf.q is omitted, the training imputation
follows the default behavior of impute.
Convergence threshold. In impute.learn this
controls the initial training-data imputation. In
predict.impute.learn it controls early stopping for the
prediction-time sweep.
For impute.learn, additional arguments passed to
impute. For predict.impute.learn, additional
arguments are currently ignored.
A list of options used when fitting
the full sweep after the training data have been imputed.
Recognized entries include ntree, nodesize,
nsplit, mtry, splitrule, bootstrap,
sampsize, samptype, perf.type, rfq,
save.memory, importance, and proximity.
Unknown entries are ignored with a warning.
Determines which variables receive a saved
full-sweep learner. The default "missing.only" saves
learners only for variables that were missing in the training data.
The option "all" saves a learner for every variable. If the
training data are complete, target.mode = "all" must be
used.
Controls which predictors are assumed to be
available later when the saved imputer is used on new data. If
NULL, all columns except the target are used. If a character
vector, the same predictor set is used for all targets. If a named
list, each target can have its own predictor set. By default, all
non-target columns are eligible predictors, so users should exclude
outcomes, future information, identifiers, or any variables that
will not be available at deployment time.
If TRUE, uses rfsrc.anonymous when
fitting the full sweep. This usually reduces the size of the saved
object.
Names used when writing saved full-sweep learners to disk.
Optional output directory. If supplied and
save.on.fit = TRUE, the manifest and the saved full-sweep
learners are written to this directory during fitting. This requires
the fst package because learners are serialized with
fast.save.
If TRUE, removes an existing output directory
before writing a new one.
If TRUE, keeps the fitted full-sweep
learners in memory in the returned object. At least one storage mode
must be enabled: either keep.models = TRUE or
out.dir with save.on.fit = TRUE.
If TRUE, keeps the completed training data in
the returned object. This is not required for later prediction.
If TRUE and out.dir is supplied,
writes the imputer to disk during fitting.
An object returned by impute.learn or
load.impute.learn.
Directory containing a saved imputer. Save and load
operations require the fst package because learners are read
and written with fast.save and fast.load.
Optional subset of target variables to load or to update during prediction. Unknown names are ignored with a warning.
If TRUE, saved learners are loaded only when they
are needed. If FALSE, all saved learners are loaded at once.
New data to be imputed. Missing columns are added and
extra columns are dropped to match the training schema. Unseen
factor levels are converted to NA and then treated as missing
values during initialization and imputation.
Maximum number of full-sweep passes applied to
newdata.
If TRUE, integer columns in the
returned data are rounded and restored as integers. Factor columns
are always conformed back to the training schema. The package
operates on real-valued and factor variables; inputs that are not
real-valued are coerced to factors during preprocessing when
possible, otherwise an error is raised.
How saved learners are reused during
prediction. The default "session" loads each needed learner
once per call to predict. The option "none" reloads a
learner every time it is needed. The option "all" loads all
saved learners before prediction starts.
This function fits a predictive imputer in two stages.
The training data are first normalized to a data frame. Variables that are not real-valued are coerced to factors when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are removed before the training schema is stored.
If the resulting training data contain missing values, the first
stage uses impute to complete the training data. The
imputation engine is chosen in exactly the same way as for
impute itself. In particular, mf.q = 1 gives standard
missForest, mf.q > 1 gives the multivariate
missForest generalization, and if mf.q is omitted the
default impute behavior is used. If the training data are
already complete and target.mode = "all", this initial
imputation step is skipped.
In the second stage, a full sweep is fit on the completed training
data. For each target selected by target.mode, rows where that
target was observed are used to fit a forest with that target on the
left-hand side and the selected deployment predictors on the
right-hand side. The saved learner bank therefore depends on
deployment.xvars. The formula argument affects the
initial training-data imputation step, but it does not define the
saved predictor bank for the later test-time sweep.
By default, deployment.xvars = NULL allows every non-target
column to be used as a predictor. This is convenient, but it can also
introduce leakage if the training data include outcomes, future-only
variables, identifiers, or any fields that will not be available when
the learned imputer is applied to new data. Restrict
deployment.xvars when that is a concern.
When the imputer is saved to disk, each full-sweep learner is written
separately using fast.save. Loading uses fast.load. In
practice this gives a small manifest plus a directory of saved
learners. The fst package is therefore required for save and
load operations. The explicit save method can write learners either
from memory or by reloading them from an attached saved path.
Prediction starts by matching newdata to the training schema,
filling missing values with training means or modes, and then
applying one or more full-sweep passes. Only the targets selected by
target.mode are updated by saved learners.
If target.mode = "missing.only", a variable that was complete
in training but missing in new data is initialized from the training
fit but does not receive a model-based update. Use
target.mode = "all" if missing values may appear later in any
variable. Complete training data also require
target.mode = "all", because otherwise there are no missing
variables from which to determine the saved targets.
impute.learn returns an object of class
c("impute.learn.rfsrc", "impute.learn"). The object
contains a manifest, optionally the fitted full-sweep learners,
optionally the completed training data, and optionally a path to the
saved imputer on disk.
load.impute.learn returns an object of the same class.
predict.impute.learn returns a data frame with imputed values
overlaid. An attribute named "impute.learn.info" contains
prediction-time diagnostics such as the number of sweep passes,
pass-difference history, caching mode, disk-load counts, schema
harmonization details, and any targets skipped because a learner was
unavailable or a prediction failed.
Stekhoven D.J. and Buhlmann P. (2012). MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363–377.
impute.rfsrc, rfsrc, and
predict.rfsrc.
# \donttest{
## ------------------------------------------------------------
## small data example: uses missForest for impute engine
## ------------------------------------------------------------
set.seed(101)
aq <- airquality[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")]
aq$Month <- factor(aq$Month)
id <- sample(1:nrow(aq), 100)
train <- aq[id, ]
test <- aq[-id, ]
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5)
)
test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
head(test.imp)
# }
if (FALSE) { # \dontrun{
## ------------------------------------------------------------
## Save the learned imputer to disk and load it later.
## This explicit save example writes learners kept in memory.
## Uses missForest for the impute engine.
## ------------------------------------------------------------
bundle.dir <- file.path(tempdir(), "aq.imputer")
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5),
keep.models = TRUE,
verbose = FALSE
)
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
unlink(bundle.dir, recursive = TRUE)
## ------------------------------------------------------------
## Challenging example with factors, uses save/reload
## ------------------------------------------------------------
## load pbc, convert everything to factors
data(pbc, package = "randomForestSRC")
dta <- data.frame(lapply(pbc, factor))
dta$days <- pbc$days
dta$status <- dta$status
## split the data into unbalanced train/test data (25/75)
## the train/test data have the same levels, but different labels
idx <- sample(1:nrow(dta), round(nrow(dta) * .25))
train <- dta[idx,]
test <- dta[-idx,]
## even harder ... factor level not previously encountered in training
levels(test$stage) <- c(levels(test$stage), "fake")
test$stage[sample(seq_len(nrow(test)), 10)] <- "fake"
## train forest
fit <- suppressWarnings(impute.learn(Surv(days, status) ~ ., train, keep.models = TRUE))
## save/reload
bundle.dir <- file.path(tempdir(), "pbc.imputer")
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
print(summary(test.imp))
unlink(bundle.dir, recursive = TRUE)
} # }