Random forest imputation is typically fit and applied to the same
data set, treating imputation as a one-time task rather than a reusable
procedure. The new impute.learn() function in
randomForestSRC is built for a different situation, where
the imputer is learned once from a training sample and then reused to
fill in missing values for new observations at a later date. If the
training data are themselves incomplete, the method first imputes them
using the existing impute() engine. It then fits a single
sweep of target-specific random forests on the imputed training data and
saves only that sweep for later use. Because nothing from the earlier
iterative imputation of the training data needs to be retained, the
result is a lean, portable object that can be saved, reloaded, and
applied repeatedly without any refitting. The method also works when the
training data are already complete, and it lets the user specify which
variables will be available at test time. This vignette describes the
method, explains the R interface, and walks through
practical workflows for fitting, saving, loading, and applying the
learned imputer.
Random forest imputation has become a standard nonparametric tool for
mixed-type missing data. It is flexible, capable of capturing nonlinear
structure and interactions, and requires little in the way of
distributional assumptions. The classical missForest
algorithm of Stekhoven and B"uhlmann (2012) and the random forest
missing-data algorithms studied by Tang and Ishwaran (2017) are
representative of this approach [1, 2]. In
both cases, an incomplete data set is iteratively imputed by fitting
forests and using those forests to update currently missing values.
That approach works well when the goal is to impute a data set
already in hand. In many practical settings, however, the analyst wants
an imputation rule that is estimated once from a training sample and
then reused on new observations as they arrive. This need arises
naturally in external validation studies and production deployment,
where new records must be imputed using only information learned from
the training data and the imputer itself must be easy to save and reload
without refitting. Recent work on this problem includes
missForestPredict, which extends missForest to
impute new observations at prediction time [3].
The function impute.learn() is designed for this same
setting but takes a different approach. If the training data contain
missing values, they are first imputed using impute(). The
function then fits one forest for each selected target variable and
saves only that final sweep of forests, discarding the earlier iterative
steps used to impute the training data. This is what sets the method
apart from missForestPredict, which saves the models
produced at every iteration and replays that full sequence when imputing
new observations [3]. Saving only the
final sweep keeps the learned imputer lean and quick to reload, which is
one of the main reasons it remains practical for routine use.
Three additional features are worth noting at the outset. First, the
method supports the full range of random forest imputation schemes
available in randomForestSRC, not just
missForest-style imputation. Second,
impute.learn() can be fit even when the training data have
no missing values, as long as target.mode = "all" is
specified, which lets the user build protection against future
missingness starting from a complete training sample. Third, the user
can restrict which predictors are permitted at test time through
deployment.xvars, keeping the learned imputer focused on
the variables that will actually be available when new data arrive.
Let $$ \mathscr{D}_{\mathrm{tr}} \;=\; \{{\bf x}_i=(X_{i1},\ldots,X_{ip})\}_{i=1}^n $$ denote the training data after coercion to a tabular form, and write $$ {\bf X}\,=\, (X_{ij})_{1\le i\le n,\,1\le j\le p} $$ for the associated training matrix. In the current implementation, each variable is either numeric or factor. Rows and columns that are entirely missing are removed before fitting. Write for the indicator that entry is missing in the training data.
The learned imputer saves information from the training data that is
needed later, including variable type, factor levels when applicable,
and the original column order. Let
denote the set of target variables that receive saved forests. The
argument target.mode determines this set through
Thus the default saves forests only for
variables that were missing in the training data, whereas
"all" saves one forest for every variable. When the
training data are complete, "all" is required.
For each target
,
let
denote the set of predictors allowed for that target. By default, all
other variables are eligible. The argument deployment.xvars
allows the user to restrict these predictors to the variables that will
truly be available at test time.
Let $$ \mathscr{D}_{\mathrm{te}} \;=\; \{{\bf x}_i^{\mathrm{te}}=(X_{i1}^{\mathrm{te}},\ldots,X_{ip}^{\mathrm{te}})\}_{i=1}^m $$ denote later data to be imputed. The central task is to estimate the imputation rule from once and then apply that same rule to without refitting any forest.
If the training data contain missing values, the first step uses
impute() to produce an imputed training matrix $$
\widetilde{{\bf X}}
\;=\;
\bigl(\widetilde{X}_{ij}\bigr)_{1\le i\le n,\,1\le j\le p}.
$$ The arguments mf.q, max.iter, and
formula retain the same meaning they have in
impute(), and formula affects only this first
stage. If the training data are already complete and
target.mode = "all", this stage is skipped and $\widetilde{{\bf X}}$ is simply the training
matrix itself.
Once $\widetilde{{\bf X}}$ is in hand, the method builds one supervised forest for each target variable . The forest for target is trained only on rows where that variable was actually observed in the original training data, A forest for target is then fit on rows , using the completed predictors and the observed target values : $$ \widehat{f}_j \;=\; \operatorname{RF}_j\Bigl( \bigl\{(\widetilde{X}_{i,\mathcal{P}(j)},X_{ij}) : i\in\mathcal{O}_j\bigr\} \Bigr), \qquad j\in\mathcal{T}. $$ Only one forest is produced and saved per target. This single-sweep design is a defining feature of the method. Because nothing from the earlier iterative imputation of the training data needs to be retained, the saved object stays lean and is straightforward to reload and reuse.
Before any predictions are made, the new data are matched to the training variables in four steps.
Missing columns are added and filled with
NA.
Extra columns not seen in training are dropped.
Column order is restored to match the training order.
Factor levels not seen in training are converted to
NA.
The last step is important. A factor label that did not appear in the training data is treated as missing rather than carried forward as an unrecognized level.
Once the columns are matched, missing entries are initialized with training means or modes. Integer-valued variables are restored as integers at the end. In this step, only cells that are missing in the new data are touched; observed values supplied by the user are left exactly as they are.
Let denote the set of rows in the new data for which target is missing after the column-matching step. Starting from the initialized matrix ${\bf X}^{(0)}$, the method sweeps repeatedly through the saved targets in their stored order. In each pass, for each target , the saved forest fills in only the rows that are still missing, Here denotes the current working matrix within pass , so updates made earlier in the pass are immediately available to targets visited later in the same pass.
When target.mode = "missing.only", only variables that
had missing values in the training data receive saved forests. A
variable that was fully observed in training but turns up missing in new
data will therefore keep its starting value throughout, with no further
adjustment. When future missingness could affect any variable,
target.mode = "all" is generally the safer choice.
After each pass, the procedure compares the newly imputed values
against those from the previous pass. For a numeric variable, the
comparison uses a scaled root mean squared difference computed over rows
that were missing at test time. For a factor variable, it counts the
fraction of labels that changed. These
comparison scores are then averaged across targets,
where targets with no missing rows in
the new data are left out of the average. The passes stop early when
,
where
is the user-supplied value of eps, and otherwise continue
until max.predict.iter passes have been completed.
The learned imputer is returned as an object with three parts.
A small record of variable types, factor levels, targets, predictor sets, sweep order, starting values, and timing information.
Optionally, the saved forests held in memory.
Optionally, an on-disk form consisting of
manifest.rds plus one saved forest file for each
target.
At least one storage mode must be used. The forests can remain in
memory, be written to disk, or both. The on-disk form relies on the
package wrappers fast.save() and fast.load(),
so the fst package is required when disk storage is used.
Keeping the object small and fast to reload is a deliberate design goal,
since the whole point is that the learned imputer can be picked up and
reused with minimal overhead.
randomForestSRC
The learned test-time imputer is used through four principal functions.
fit <- impute.learn(...)
newdata.imp <- predict(fit, newdata = ...)
save.impute.learn(fit, path = ...)
load.fit <- load.impute.learn(path = ...)The fitting and prediction interfaces follow the familiar
impute() workflow, but several arguments play a special
role for test-time use.
A typical fitting call has the form
fit <- impute.learn(
data = train,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 100, nsplit = 10),
target.mode = "all",
deployment.xvars = NULL,
anonymous = TRUE
)The main fitting arguments are the following.
formula controls only the initial imputation of the
training data when the training data are incomplete.
deployment.xvars specifies which variables may be
used as predictors at test time.
target.mode determines which variables receive saved
forests. Use "all" whenever later missingness may occur in
any variable, and always when the training data are complete.
full.sweep.options sets the forest tuning values for
the saved test-time forests and not for the initial call to
impute().
anonymous, keep.models,
out.dir, and save.on.fit control object size
and where the saved forests are stored.
The distinction between formula and
deployment.xvars is worth spelling out. The formula can
draw on any variables that help impute the training data. The deployment
argument then states which variables are permitted to enter the saved
test-time forests. This makes it possible to use a rich formula during
training while still restricting prediction to the variables that will
actually be observed when new data arrive.
Once a learned imputer has been fit or reloaded, imputing new data is as simple as
newdata.imp <- predict(
fit,
newdata = test,
max.predict.iter = 3,
eps = 1e-3,
)The returned object is the imputed data frame itself. Additional information is attached as an attribute containing record pass history, columns added or removed when the new data are matched to the training variables, unseen factor levels, forest load counts, and any per-target problems that arose during prediction.
When out.dir is supplied during fitting, the learned
imputer is written directly to disk. A fitted object held in memory can
also be saved at any later point.
save.impute.learn(fit, path = bundle.dir)
load.fit <- load.impute.learn(bundle.dir, lazy = TRUE)The load function can also read only selected targets, which is useful when later work needs only a subset of variables to be updated.
The airquality data provides a small example with both
numeric variables and missing values. We keep five variables, convert
Month to a factor, split the rows into training and held
out sets, and learn the imputer from the training rows. Because the
training data are incomplete, impute.learn() first calls
impute() to impute the training sample and then fits the
saved test-time forests. Setting target.mode = "all" asks
the function to save a forest for every variable, including variables
that happen to be fully observed in this particular training split.
library(randomForestSRC)
set.seed(101)
aq <- airquality[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")]
aq$Month <- factor(aq$Month)
id <- sample(seq_len(nrow(aq)), 100)
train <- aq[id, ]
test <- aq[-id, ]
fit <- impute.learn(
data = train,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5),
target.mode = "all"
)
test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
info <- attr(test.imp, "impute.learn.info")
head(test.imp)
Ozone Solar.R Wind Temp Month
1 41 190 7.4 67 5
5 17 158 14.3 56 5
11 7 96 6.9 74 5
13 11 290 9.2 66 5
16 14 334 11.5 64 5
19 30 322 11.5 68 5
info$pass.diff
[1] 0.6191309 0.1669537The object test.imp is the imputed held out data. The
vector info\$pass.diff reports the change from one
test-time pass to the next and gives a direct check that the sweep is
settling. The example shows the intended use of the method in its
simplest form: the imputer is learned once from the training rows (here
using missForest since mf.q=1) and then
applied to new rows as they arrive.
The next example illustrates a feature that is easy to overlook. The
training data do not need to contain missing values. Here the complete
iris data are used to learn one forest for every variable,
and missing values are introduced only in the new data. Because the
training data are complete, target.mode = "all" is
required.
library(randomForestSRC)
train <- iris
test <- iris[1:12, ]
set.seed(7)
test$Sepal.Length[sample(seq_len(nrow(test)), 3)] <- NA
test$Species[sample(seq_len(nrow(test)), 2)] <- NA
fit <- impute.learn(
data = train,
target.mode = "all",
full.sweep.options = list(ntree = 50, nsplit = 5)
)
test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
test.imp
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.100000 3.5 1.4 0.2 setosa
2 4.900000 3.0 1.4 0.2 setosa
3 4.656523 3.2 1.3 0.2 setosa
4 4.600000 3.1 1.5 0.2 setosa
5 5.000000 3.6 1.4 0.2 setosa
6 5.400000 3.9 1.7 0.4 setosa
7 4.950964 3.4 1.4 0.3 setosa
8 5.000000 3.4 1.5 0.2 setosa
9 4.400000 2.9 1.4 0.2 setosa
10 4.802320 3.1 1.5 0.1 setosa
11 5.400000 3.7 1.5 0.2 setosa
12 4.800000 3.4 1.6 0.2 setosaIn this example the initial training imputation step is skipped entirely and the function fits the saved forests directly from the complete training data. This example are for users who want to build protection against missingness that may only arise after the model has been trained.
The last example illustrates a more detailed workflow. The
pbc data are converted to a mainly factor representation,
the learned imputer is fit once, the object is written to disk, and it
is later reloaded with lazy loading. We also introduce a new level of
stage in the new data to show how unseen factor levels are
handled. The example further illustrates the distinction between
formula and deployment.xvars: the survival
formula is used while imputing the training data (supervised
imputation), but the saved test-time forests are restricted to a chosen
predictor set.
library(randomForestSRC)
data(pbc, package = "randomForestSRC")
dta <- data.frame(lapply(pbc, factor))
dta$days <- pbc$days
set.seed(1)
idx <- sample(seq_len(nrow(dta)), round(nrow(dta) * .25))
train <- dta[idx, ]
test <- dta[-idx, ]
xvars <- setdiff(names(train), c("days", "status"))
levels(test$stage) <- c(levels(test$stage), "fake")
test$stage[sample(seq_len(nrow(test)), 10)] <- "fake"
fit <- suppressWarnings(
impute.learn(
Surv(days, status) ~ .,
train,
deployment.xvars = xvars,
keep.models = TRUE,
target.mode = "all",
anonymous = TRUE
)
)
bundle.dir <- file.path(tempdir(), "pbc.imputer")
save.impute.learn(fit, bundle.dir, verbose = FALSE)
load.fit <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(
load.fit,
test,
max.predict.iter = 2,
verbose = FALSE
)
attr(test.imp, "impute.learn.info")$unseen.levels$stage
unlink(bundle.dir, recursive = TRUE)The inserted level "fake" is converted to
NA when the new data are matched to the training factor
levels, after which it is imputed like any other missing value. Passing
lazy = TRUE to load.impute.learn() delays each
file read until the corresponding forest is actually needed during
prediction.
When out.dir is supplied during fitting, the learned
imputer can be written directly to disk. Alternatively, a fitted object
kept in memory can be saved later.
save.impute.learn(fit, path = bundle.dir)
load.fit <- load.impute.learn(bundle.dir, lazy = TRUE)The load function can also read only selected targets. This is useful
when later work needs only a subset of variables to be updated. Combined
with cache.learners, this gives fine control over memory
use and file reads.
The learned imputer is straightforward to use, but several practical points are worth keeping in mind.
With the default "missing.only", only variables
that were missing in the training data are saved as targets. Those
variables can be updated at test time. A variable that was complete in
the training data is not a saved target under
"missing.only". If it is missing in new data, it is filled
with its stored starting value and is not updated further. Complete
training data therefore require
target.mode = "all".
. This determines which variables may be used as predictors when
updating a saved target, but it does not determine which variables are
themselves saved targets. A variable omitted from
deployment.xvars can still be updated at test time if it is
a saved target. Omitting a variable from deployment.xvars
only means that it is not used as a predictor for other targets. A
variable that is not a saved target receives only its stored starting
value when it is missing in new data, whether or not it appears in
deployment.xvars. It may still be used as a predictor for
other targets if deployment.xvars allows it.
Related to point 2, by default if deployment.xvars
is left unspecified, every variable (excluding the target) will be used
to impute each target. This can be too permissive when the training data
contain outcomes, information measured after baseline, identifiers, or
other fields that will not be available when new data arrive. On the
other hand, if deployment.xvars is chosen too narrowly,
fitting stops whenever a saved target is left with no
predictors.
. They are removed from newdata at prediction time.
Suppose, for example, that a single new observation is missing in every
variable retained by the fitted imputer. That observation will still be
imputed. The method first fills every retained variable with its stored
starting value and then applies the saved forests to the saved targets
in the usual sweep order. Under target.mode = "all", every
retained variable is eligible for update. Under
"missing.only", only variables that were missing in the
training data are updated, and the remaining variables stay at their
starting values.
If the training data have no missing values and
target.mode = "all" is set, the method skips the initial
call to impute() and fits the saved forests directly from
the training data.
This is the main reason the object stays lean. The iterative steps used to impute the training data are not stored and do not need to be.
A level not present in training is converted to NA
before the first test-time pass.
The saved object consists of a small manifest.rds
file plus one forest file for each target. Saving and loading in this
format requires the package.
Only cells that are missing after the new data are matched to the training variables are ever updated during prediction.
Random forest imputation has long been treated as something you do to
a data set once and then move on. Recent work on
missForestPredict (Albu et al., 2025) showed that the same
idea can be extended to new observations arriving after training, and
impute.learn() pursues that goal within
randomForestSRC. The goal is to estimate an imputation rule
from training data once, save it in memory or on disk, and apply it
unchanged whenever new observations appear. No forest is refit at test
time. This is especially natural when model development and later use
are separated in time, as in external validation, prospective data
collection, or enterprise prediction pipelines.
What makes impute.learn() distinctive is that it saves
only one forest per target variable. When the training data themselves
contain missing values, the earlier iterations used to impute those data
are simply discarded. Only the final sweep is kept. This is a deliberate
departure from approaches that replay a full iterative sequence at test
time. Doing so reduces storage requirements and reload times, making the
imputer more practical for day-to-day work.
The method is also flexible about how much protection the user wants
against later missingness. If the training data are already complete,
impute.learn() can still be fit by setting
target.mode = "all", so that every retained variable
receives a saved forest and can later be updated if needed. More
generally, target.mode determines which variables are
eligible for model-based updating, whereas deployment.xvars
restricts which variables may be used as predictors in those updates.
This separation lets the user control both sides of the problem, namely
which variables may be imputed later and which accompanying information
may be used to impute them.
Because the saved forests are fit with the usual
randomForestSRC forest engine, the method also retains
familiar fitting controls. At prediction time the procedure remains
straightforward. New data are matched to the retained training
variables, missing values are started from stored training summaries,
and saved targets are then updated in the stored sweep order. Even
extreme cases behave predictably. A new observation that is entirely
missing is not discarded at test time. It is first filled with starting
values and then updated wherever a saved target forest is available.
Variables that are not saved targets remain at their starting
values.
The idea of for random forests is a new exciting concept which we believe has many ways to grow in the future. Once an imputer is learned from one sample and carried forward to another, concerns such as distributional shift, early warning signs that the learned rule no longer fits the incoming data, and the tradeoff between accuracy, storage, and reload time all become genuine statistical problems in their own right. Although the present method does not resolve those questions, it provides a practical and transparent framework in which we can begin to address them.
Cite this vignette as
H. Ishwaran, M. Lu, and
U. B. Kogalur. 2026. “randomForestSRC: learned test-time imputation
vignette.” http://randomforestsrc.org/articles/testimput.html.
@misc{HemantTESTIM,
author = "Hemant Ishwaran and Min Lu and Udaya B. Kogalur",
title = {{randomForestSRC}: learned test-time imputation vignette},
year = {2026},
url = {http://randomforestsrc.org/articles/testimput.html},
howpublished = "\url{http://randomforestsrc.org/articles/testimput.html}",
note = "[accessed date]"
}