Title: | Cross-Validating Regression Models |
---|---|
Description: | Cross-validation methods of regression models that exploit features of various modeling functions to improve speed. Some of the methods implemented in the package are novel, as described in the package vignettes; for general introductions to cross-validation, see, for example, Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2021, ISBN 978-1-0716-1417-4, Secs. 5.1, 5.3), "An Introduction to Statistical Learning with Applications in R, Second Edition", and Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009, ISBN 978-0-387-84857-0, Sec. 7.10), "The Elements of Statistical Learning, Second Edition". |
Authors: | John Fox [aut] , Georges Monette [aut, cre] |
Maintainer: | Georges Monette <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0.3 |
Built: | 2024-11-15 06:25:00 UTC |
Source: | https://github.com/gmonette/cv |
cv()
is a parallelized generic k-fold (including n-fold, i.e., leave-one-out)
cross-validation function, with a default method,
specific methods for linear and generalized-linear models that can be much
more computationally efficient, and a method for robust linear models.
There are also cv()
methods for mixed-effects models,
for model-selection procedures,
and for several models fit to the same data,
which are documented separately.
cv(model, data, criterion, k, reps = 1L, seed, ...) ## Default S3 method: cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, criterion.name = deparse(substitute(criterion)), details = k <= 10L, confint = n >= 400L, level = 0.95, ncores = 1L, type = "response", start = FALSE, model.function, ... ) ## S3 method for class 'lm' cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, details = k <= 10L, confint = n >= 400L, level = 0.95, method = c("auto", "hatvalues", "Woodbury", "naive"), ncores = 1L, ... ) ## S3 method for class 'glm' cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, details = k <= 10L, confint = n >= 400L, level = 0.95, method = c("exact", "hatvalues", "Woodbury"), ncores = 1L, start = FALSE, ... ) ## S3 method for class 'rlm' cv(model, data, criterion, k, reps = 1L, seed, ...) ## S3 method for class 'cv' print(x, digits = getOption("digits"), ...) ## S3 method for class 'cv' summary(object, digits = getOption("digits"), ...) ## S3 method for class 'cvList' print(x, ...) ## S3 method for class 'cvList' summary(object, ...) ## S3 method for class 'cv' plot(x, y, what = c("CV criterion", "coefficients"), ...) ## S3 method for class 'cvList' plot( x, y, what = c("adjusted CV criterion", "CV criterion"), confint = TRUE, ... ) cvInfo(object, what, ...) ## S3 method for class 'cv' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cvModList' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cvList' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cv' as.data.frame( x, row.names = NULL, optional = TRUE, rows = c("cv", "folds"), columns = c("criteria", "coefficients"), ... ) ## S3 method for class 'cvList' as.data.frame(x, row.names = NULL, optional = TRUE, ...) ## S3 method for class 'cvDataFrame' print(x, digits = getOption("digits") - 2L, ...) ## S3 method for class 'cvDataFrame' summary( object, formula, subset = NULL, fun = mean, include = c("cv", "folds", "all"), ... )
cv(model, data, criterion, k, reps = 1L, seed, ...) ## Default S3 method: cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, criterion.name = deparse(substitute(criterion)), details = k <= 10L, confint = n >= 400L, level = 0.95, ncores = 1L, type = "response", start = FALSE, model.function, ... ) ## S3 method for class 'lm' cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, details = k <= 10L, confint = n >= 400L, level = 0.95, method = c("auto", "hatvalues", "Woodbury", "naive"), ncores = 1L, ... ) ## S3 method for class 'glm' cv( model, data = insight::get_data(model), criterion = mse, k = 10L, reps = 1L, seed = NULL, details = k <= 10L, confint = n >= 400L, level = 0.95, method = c("exact", "hatvalues", "Woodbury"), ncores = 1L, start = FALSE, ... ) ## S3 method for class 'rlm' cv(model, data, criterion, k, reps = 1L, seed, ...) ## S3 method for class 'cv' print(x, digits = getOption("digits"), ...) ## S3 method for class 'cv' summary(object, digits = getOption("digits"), ...) ## S3 method for class 'cvList' print(x, ...) ## S3 method for class 'cvList' summary(object, ...) ## S3 method for class 'cv' plot(x, y, what = c("CV criterion", "coefficients"), ...) ## S3 method for class 'cvList' plot( x, y, what = c("adjusted CV criterion", "CV criterion"), confint = TRUE, ... ) cvInfo(object, what, ...) ## S3 method for class 'cv' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cvModList' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cvList' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name"), ... ) ## S3 method for class 'cv' as.data.frame( x, row.names = NULL, optional = TRUE, rows = c("cv", "folds"), columns = c("criteria", "coefficients"), ... ) ## S3 method for class 'cvList' as.data.frame(x, row.names = NULL, optional = TRUE, ...) ## S3 method for class 'cvDataFrame' print(x, digits = getOption("digits") - 2L, ...) ## S3 method for class 'cvDataFrame' summary( object, formula, subset = NULL, fun = mean, include = c("cv", "folds", "all"), ... )
model |
a regression model object (see Details). |
data |
data frame to which the model was fit (not usually necessary). |
criterion |
cross-validation criterion ("cost" or lack-of-fit) function of form |
k |
perform k-fold cross-validation (default is |
reps |
number of times to replicate k-fold CV (default is |
seed |
for R's random number generator; optional, if not supplied a random seed will be selected and saved; not needed for n-fold cross-validation. |
... |
to match generic; passed to |
criterion.name |
a character string giving the name of the CV criterion function
in the returned |
details |
if |
confint |
if |
level |
confidence level (default |
ncores |
number of cores to use for parallel computations
(default is |
type |
for the default method, value to be passed to the
|
start |
if |
model.function |
a regression function, typically for a new |
method |
computational method to apply to a linear (i.e., |
x |
a |
digits |
significant digits for printing,
default taken from the |
object |
an object to summarize or a |
y |
to match the |
what |
for For Partial matching is supported, so, e.g., |
row.names |
optional row names for the result,
defaults to |
optional |
to match the |
rows |
the rows of the resulting data frame to retain: setting
|
columns |
the columns of the resulting data frame to retain:
setting |
formula |
of the form |
subset |
a subsetting expression; the default ( |
fun |
summary function to apply, defaulting to |
include |
which rows of the |
The default cv()
method uses update()
to refit the model
to each fold, and should work if there are appropriate update()
and predict()
methods, and if the default method for GetResponse()
works or if a GetResponse()
method is supplied.
The "lm"
and "glm"
methods can use much faster computational
algorithms, as selected by the method
argument. The linear-model
method accommodates weighted linear models.
For both classes of models, for the leave-one-out (n-fold) case, fitted values
for the folds can be computed from the hat-values via
method="hatvalues"
without refitting the model;
for GLMs, this method is approximate, for LMs it is exact.
Again for both classes of models, when more than one case is omitted
in each fold, fitted values may be obtained without refitting the
model by exploiting the Woodbury matrix identity via method="Woodbury"
.
As for hatvalues, this method is exact for LMs and approximate for
GLMs.
The default for linear models is method="auto"
,
which is equivalent to method="hatvalues"
for n-fold cross-validation
and method="Woodbury"
otherwise; method="naive"
refits
the model via update()
and is generally much slower. The
default for generalized linear models is method="exact"
,
which employs update()
. This default is conservative, and
it is usually safe to use method="hatvalues"
for n-fold CV
or method="Woodbury"
for k-fold CV.
There is also a method for robust linear models fit by
rlm()
in the MASS package (to avoid
inheriting the "lm"
method for which the default "auto"
computational method would be inappropriate).
For additional details, see the "Cross-validating regression models"
vignette (vignette("cv", package="cv")
).
cv()
is designed to be extensible to other classes of regression
models; see the "Extending the cv package" vignette
(vignette("cv-extend", package="cv")
).
The cv()
methods return an object of class "cv"
, with the CV criterion
("CV crit"
), the bias-adjusted CV criterion ("adj CV crit"
),
the criterion for the model applied to the full data ("full crit"
),
the confidence interval and level for the bias-adjusted CV criterion ("confint"
),
the number of folds ("k"
), and the seed for R's random-number
generator ("seed"
). If details=TRUE
, then the returned object
will also include a "details"
component, which is a list of two
elements: "criterion"
, containing the CV criterion computed for the
cases in each fold; and "coefficients"
, regression coefficients computed
for the model with each fold deleted. Some methods may return a
subset of these components and may add additional information.
If reps
> 1
, then an object of class "cvList"
is returned,
which is literally a list of "cv"
objects.
cv(default)
: "default"
method.
cv(lm)
: "lm"
method.
cv(glm)
: "glm"
method.
cv(rlm)
: "rlm"
method (to avoid inheriting the "lm"
method).
print(cv)
: print()
method for "cv"
objects.
summary(cv)
: summary()
method for "cv"
objects.
plot(cv)
: plot()
method for "cv"
objects.
as.data.frame(cv)
: as.data.frame()
method for "cv"
objects.
print(cvList)
: print()
method for "cvList"
objects.
summary(cvList)
: summary()
method for "cvList"
objects.
plot(cvList)
: plot()
method for "cvList"
objects.
cvInfo()
: extract information from a "cv"
object.
as.data.frame(cvList)
: as.data.frame()
method for "cvList"
objects.
print(cvDataFrame)
: print()
method for "cvDataFrame"
objects.
summary(cvDataFrame)
: summary()
method for "cvDataFrame"
objects.
cv.merMod
, cv.function
,
cv.modList
.
if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ data("Auto", package="ISLR2") m.auto <- lm(mpg ~ horsepower, data=Auto) cv(m.auto, k="loo") summary(cv(m.auto, k="loo")) summary(cv.auto <- cv(m.auto, seed=1234)) compareFolds(cv.auto) plot(cv.auto) plot(cv.auto, what="coefficients") summary(cv.auto.reps <- cv(m.auto, seed=1234, reps=3)) cvInfo(cv.auto.reps, what="adjusted CV criterion") plot(cv.auto.reps) plot(cv(m.auto, seed=1234, reps=10, confint=TRUE)) D.auto.reps <- as.data.frame(cv.auto.reps) head(D.auto.reps) summary(D.auto.reps, mse ~ rep + fold, include="folds") summary(D.auto.reps, mse ~ rep + fold, include = "folds", subset = fold <= 5) # first 5 folds summary(D.auto.reps, mse ~ rep, include="folds") summary(D.auto.reps, mse ~ rep, fun=sd, include="folds") }) } else { cat("\n install 'ISLR2' package to run these examples\n") } if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Mroz", package="carData") m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial) summary(cv.mroz <- cv(m.mroz, criterion=BayesRule, seed=123)) cvInfo(cv.mroz) cvInfo(cv.mroz, "adjusted") cvInfo(cv.mroz, "confint") data("Duncan", package="carData") m.lm <- lm(prestige ~ income + education, data=Duncan) m.rlm <- MASS::rlm(prestige ~ income + education, data=Duncan) summary(cv(m.lm, k="loo", method="Woodbury")) summary(cv(m.rlm, k="loo")) }) } else { cat("\n install 'carData' package to run these examples\n") }
if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ data("Auto", package="ISLR2") m.auto <- lm(mpg ~ horsepower, data=Auto) cv(m.auto, k="loo") summary(cv(m.auto, k="loo")) summary(cv.auto <- cv(m.auto, seed=1234)) compareFolds(cv.auto) plot(cv.auto) plot(cv.auto, what="coefficients") summary(cv.auto.reps <- cv(m.auto, seed=1234, reps=3)) cvInfo(cv.auto.reps, what="adjusted CV criterion") plot(cv.auto.reps) plot(cv(m.auto, seed=1234, reps=10, confint=TRUE)) D.auto.reps <- as.data.frame(cv.auto.reps) head(D.auto.reps) summary(D.auto.reps, mse ~ rep + fold, include="folds") summary(D.auto.reps, mse ~ rep + fold, include = "folds", subset = fold <= 5) # first 5 folds summary(D.auto.reps, mse ~ rep, include="folds") summary(D.auto.reps, mse ~ rep, fun=sd, include="folds") }) } else { cat("\n install 'ISLR2' package to run these examples\n") } if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Mroz", package="carData") m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial) summary(cv.mroz <- cv(m.mroz, criterion=BayesRule, seed=123)) cvInfo(cv.mroz) cvInfo(cv.mroz, "adjusted") cvInfo(cv.mroz, "confint") data("Duncan", package="carData") m.lm <- lm(prestige ~ income + education, data=Duncan) m.rlm <- MASS::rlm(prestige ~ income + education, data=Duncan) summary(cv(m.lm, k="loo", method="Woodbury")) summary(cv(m.rlm, k="loo")) }) } else { cat("\n install 'carData' package to run these examples\n") }
The cv()
"function"
method
is a general function to cross-validate a model-selection procedure,
such as the following:
selectStepAIC()
is a procedure that applies the stepAIC()
model-selection function in the MASS package; selectTrans()
is a procedure
for selecting predictor and response transformations in regression, which
uses the powerTransform()
function in the
car package; selectTransAndStepAIC()
combines predictor and response
transformations with predictor selection; and selectModelList()
uses cross-validation to select a model from a list of models created by
models()
and employs (meta) cross-validation to assess the predictive
accuracy of this procedure.
## S3 method for class ''function'' cv( model, data, criterion = mse, k = 10L, reps = 1L, seed = NULL, working.model = NULL, y.expression = NULL, confint = n >= 400L, level = 0.95, details = k <= 10L, save.model = FALSE, ncores = 1L, ... ) selectStepAIC( data, indices, model, criterion = mse, AIC = TRUE, details = TRUE, save.model = FALSE, ... ) selectTrans( data, indices, details = TRUE, save.model = FALSE, model, criterion = mse, predictors, response, family = c("bcPower", "bcnPower", "yjPower", "basicPower"), family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"), rounded = TRUE, ... ) selectTransStepAIC( data, indices, details = TRUE, save.model = FALSE, model, criterion = mse, predictors, response, family = c("bcPower", "bcnPower", "yjPower", "basicPower"), family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"), rounded = TRUE, AIC = TRUE, ... ) selectModelList( data, indices, model, criterion = mse, k = 10L, k.meta = k, details = k <= 10L, save.model = FALSE, seed = FALSE, quietly = TRUE, ... ) compareFolds(object, digits = 3, ...) ## S3 method for class 'cvSelect' coef(object, average, NAs = 0, ...) ## S3 method for class 'cvSelect' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name", "selected model"), ... )
## S3 method for class ''function'' cv( model, data, criterion = mse, k = 10L, reps = 1L, seed = NULL, working.model = NULL, y.expression = NULL, confint = n >= 400L, level = 0.95, details = k <= 10L, save.model = FALSE, ncores = 1L, ... ) selectStepAIC( data, indices, model, criterion = mse, AIC = TRUE, details = TRUE, save.model = FALSE, ... ) selectTrans( data, indices, details = TRUE, save.model = FALSE, model, criterion = mse, predictors, response, family = c("bcPower", "bcnPower", "yjPower", "basicPower"), family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"), rounded = TRUE, ... ) selectTransStepAIC( data, indices, details = TRUE, save.model = FALSE, model, criterion = mse, predictors, response, family = c("bcPower", "bcnPower", "yjPower", "basicPower"), family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"), rounded = TRUE, AIC = TRUE, ... ) selectModelList( data, indices, model, criterion = mse, k = 10L, k.meta = k, details = k <= 10L, save.model = FALSE, seed = FALSE, quietly = TRUE, ... ) compareFolds(object, digits = 3, ...) ## S3 method for class 'cvSelect' coef(object, average, NAs = 0, ...) ## S3 method for class 'cvSelect' cvInfo( object, what = c("CV criterion", "adjusted CV criterion", "full CV criterion", "confint", "SE", "k", "seed", "method", "criterion name", "selected model"), ... )
model |
a regression model object fit to data, or for the
|
data |
full data frame for model selection. |
criterion |
a CV criterion ("cost" or lack-of-fit) function. |
k |
perform k-fold cross-validation (default is 10); |
reps |
number of times to replicate k-fold CV (default is |
seed |
for R's random number generator; not used for n-fold cross-validation.
If not explicitly set, a seed is randomly generated and saved to make the results
reproducible. In some cases, for internal use only, |
working.model |
a regression model object fit to data, typically
to begin a model-selection process; for use with |
y.expression |
normally the response variable is found from the
|
confint |
if |
level |
confidence level (default |
details |
if |
save.model |
save the model that's selected using the full data set
(default, |
ncores |
number of cores to use for parallel computations
(default is |
... |
for |
indices |
indices of cases in data defining the current fold. |
AIC |
if |
predictors |
character vector of names of the predictors in the model to transform; if missing, no predictors will be transformed. |
response |
name of the response variable; if missing, the response won't be transformed. |
family |
transformation family for the predictors, one of
|
family.y |
transformation family for the response,
with |
rounded |
if |
k.meta |
the number of folds for meta CV; defaults
to the value of |
quietly |
if |
object |
an object of class |
digits |
significant digits for printing coefficients
(default |
average |
if supplied, a function, such as |
NAs |
values to substitute for |
what |
the information to extract from a |
The model-selection function supplied as the procedure
(for cvSelect()
)
or model
(for cv()
) argument
should accept the following arguments:
data
set to the data
argument to cvSelect()
or cv()
.
indices
the indices of the rows of data
defining the current fold; if missing,
the model-selection procedure is applied to the full data
.
to be passed via ...
from cvSelect()
or cv()
.
procedure()
or model()
should return a list with the following
named elements: fit.i
, the vector of predicted values for the cases in
the current fold computed from the model omitting these cases;
crit.all.i
, the CV criterion computed for all of the cases using
the model omitting the current fold; and (optionally) coefficients
,
parameter estimates from the model computed omitting the current fold.
When the indices
argument is missing, procedure()
returns the cross-validation criterion for all of the cases based on
the model fit to all of the cases.
For examples of model-selection functions for the procedure
argument, see the code for selectStepAIC()
,
selectTrans()
, and selectTransAndStepAIC()
.
For additional information, see the "Cross-validating model selection"
vignette (vignette("cv-select", package="cv")
)
and the "Extending the cv package" vignette
(vignette("cv-extend", package="cv")
).
An object of class "cvSelect"
,
inheriting from class "cv"
, with the CV criterion
("CV crit"
), the bias-adjusted CV criterion ("adj CV crit"
),
the criterion for the model applied to the full data ("full crit"
),
the confidence interval and level for the bias-adjusted CV criterion ("confint"
),
the number of folds ("k"
), the seed for R's random-number
generator ("seed"
), and (optionally) a list of coefficients
(or, in the case of selectTrans()
, estimated transformation
parameters, and in the case of selectTransAndStepAIC()
, both regression coefficients
and transformation parameters) for the selected models
for each fold ("coefficients"
).
If reps
> 1
, then an object of class c("cvSelectList", "cvList")
is returned,
which is literally a list of c("cvSelect", "cv")
objects.
cv(`function`)
: cv()
method for applying a model
model-selection (or specification) procedure.
selectStepAIC()
: select a regression model using the
stepAIC()
function in the MASS package.
selectTrans()
: select transformations of the predictors and response
using powerTransform()
in the car package.
selectTransStepAIC()
: select transformations of the predictors and response,
and then select predictors.
selectModelList()
: select a model using (meta) CV.
compareFolds()
: print the coefficients from the selected models
for the several folds.
coef(cvSelect)
: extract the coefficients from the selected models
for the several folds and possibly average them.
stepAIC
, bcPower
,
powerTransform
, cv
.
if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ data("Auto", package="ISLR2") m.auto <- lm(mpg ~ . - name - origin, data=Auto) cv(selectStepAIC, Auto, seed=123, working.model=m.auto) cv(selectStepAIC, Auto, seed=123, working.model=m.auto, AIC=FALSE, k=5, reps=3) # via BIC }) } else { cat("\n install the 'ISLR2' package to run these examples\n") } if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Prestige", package="carData") m.pres <- lm(prestige ~ income + education + women, data=Prestige) cvt <- cv(selectTrans, data=Prestige, working.model=m.pres, seed=123, predictors=c("income", "education", "women"), response="prestige", family="yjPower") cvt compareFolds(cvt) coef(cvt, average=median, NAs=1) # NAs not really needed here cv(m.pres, seed=123) }) } else { cat("install the 'carData' package to run these examples\n") } if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ Auto$year <- as.factor(Auto$year) Auto$origin <- factor(Auto$origin, labels=c("America", "Europe", "Japan")) rownames(Auto) <- make.names(Auto$name, unique=TRUE) Auto$name <- NULL m.auto <- lm(mpg ~ . , data=Auto) cvs <- cv(selectTransStepAIC, data=Auto, seed=76692, working.model=m.auto, criterion=medAbsErr, predictors=c("cylinders", "displacement", "horsepower", "weight", "acceleration"), response="mpg", AIC=FALSE) cvs compareFolds(cvs) }) } data("Duncan", package="carData") m1 <- lm(prestige ~ income + education, data=Duncan) m2 <- lm(prestige ~ income + education + type, data=Duncan) m3 <- lm(prestige ~ (income + education)*type, data=Duncan) summary(cv.sel <- cv(selectModelList, data=Duncan, seed=5963, working.model=models(m1, m2, m3), save.model=TRUE)) # meta CV cvInfo(cv.sel, "selected model")
if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ data("Auto", package="ISLR2") m.auto <- lm(mpg ~ . - name - origin, data=Auto) cv(selectStepAIC, Auto, seed=123, working.model=m.auto) cv(selectStepAIC, Auto, seed=123, working.model=m.auto, AIC=FALSE, k=5, reps=3) # via BIC }) } else { cat("\n install the 'ISLR2' package to run these examples\n") } if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Prestige", package="carData") m.pres <- lm(prestige ~ income + education + women, data=Prestige) cvt <- cv(selectTrans, data=Prestige, working.model=m.pres, seed=123, predictors=c("income", "education", "women"), response="prestige", family="yjPower") cvt compareFolds(cvt) coef(cvt, average=median, NAs=1) # NAs not really needed here cv(m.pres, seed=123) }) } else { cat("install the 'carData' package to run these examples\n") } if (requireNamespace("ISLR2", quietly=TRUE)){ withAutoprint({ Auto$year <- as.factor(Auto$year) Auto$origin <- factor(Auto$origin, labels=c("America", "Europe", "Japan")) rownames(Auto) <- make.names(Auto$name, unique=TRUE) Auto$name <- NULL m.auto <- lm(mpg ~ . , data=Auto) cvs <- cv(selectTransStepAIC, data=Auto, seed=76692, working.model=m.auto, criterion=medAbsErr, predictors=c("cylinders", "displacement", "horsepower", "weight", "acceleration"), response="mpg", AIC=FALSE) cvs compareFolds(cvs) }) } data("Duncan", package="carData") m1 <- lm(prestige ~ income + education, data=Duncan) m2 <- lm(prestige ~ income + education + type, data=Duncan) m3 <- lm(prestige ~ (income + education)*type, data=Duncan) summary(cv.sel <- cv(selectModelList, data=Duncan, seed=5963, working.model=models(m1, m2, m3), save.model=TRUE)) # meta CV cvInfo(cv.sel, "selected model")
cv()
methods for mixed-effect models of class "merMod"
, fit
by the lmer()
and glmer()
functions
in the lme4 package; for models of class "lme"
fit by the lme()
function in the nlme
package; and for models of class "glmmTMB"
fit by the
glmmTMB()
function in the glmmTMB package.
## S3 method for class 'merMod' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... ) ## S3 method for class 'lme' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... ) ## S3 method for class 'glmmTMB' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... )
## S3 method for class 'merMod' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... ) ## S3 method for class 'lme' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... ) ## S3 method for class 'glmmTMB' cv( model, data = insight::get_data(model), criterion = mse, k = NULL, reps = 1L, seed, details = NULL, ncores = 1L, clusterVariables, ... )
model |
a mixed-effects model object for which a |
data |
data frame to which the model was fit (not usually necessary). |
criterion |
cross-validation ("cost" or lack-of-fit) criterion function of form |
k |
perform k-fold cross-validation; |
reps |
number of times to replicate k-fold CV (default is |
seed |
for R's random number generator; optional, if not supplied a random seed will be selected and saved; not needed for n-fold cross-validation. |
details |
if |
ncores |
number of cores to use for parallel computations
(default is |
clusterVariables |
a character vector of names of the variables defining clusters for a mixed model with nested or crossed random effects; if missing, cross-validation is performed for individual cases rather than for clusters. |
... |
for |
For mixed-effects models, cross-validation can be done by "clusters" or by individual observations. If the former, predictions are based only on fixed effects; if the latter, predictions include the random effects (i.e., are the best linear unbiased predictors or "BLUPS").
The methods cv.merMod()
, cv.lme()
, and cv.glmmTMB()
,
return objects of class "cv"
, or,
if reps > 1
, of class "cvList"
(see cv()
).
cv(merMod)
: cv()
method for lmer()
and
glmer()
models from the lme4 package.
cv(lme)
: cv()
method for lme()
models from the nlme package.
cv(glmmTMB)
: cv()
method for glmmTMB()
models from the glmmTMB package.
library("lme4") # from ?lmer: (fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)) summary(cv(fm1, clusterVariables="Subject")) # LOO CV of clusters summary(cv(fm1, seed=447)) # 10-fold CV of cases summary(cv(fm1, clusterVariables="Subject", k=5, seed=834, reps=3)) # 5-fold CV of clusters, repeated 3 times library("nlme") # from ?lme (fm2 <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1)) summary(cv(fm2)) # LOO CV of cases summary(cv(fm2, clusterVariables="Subject", k=5, seed=321)) # 5-fold CV of clusters library("glmmTMB") # from ?glmmTMB (m1 <- glmmTMB(count ~ mined + (1|site), zi=~mined, family=poisson, data=Salamanders)) summary(cv(m1, seed=97816, k=5, clusterVariables="site")) # 5-fold CV of clusters summary(cv(m1, seed=34506, k=5)) # 5-fold CV of cases
library("lme4") # from ?lmer: (fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)) summary(cv(fm1, clusterVariables="Subject")) # LOO CV of clusters summary(cv(fm1, seed=447)) # 10-fold CV of cases summary(cv(fm1, clusterVariables="Subject", k=5, seed=834, reps=3)) # 5-fold CV of clusters, repeated 3 times library("nlme") # from ?lme (fm2 <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1)) summary(cv(fm2)) # LOO CV of cases summary(cv(fm2, clusterVariables="Subject", k=5, seed=321)) # 5-fold CV of clusters library("glmmTMB") # from ?glmmTMB (m1 <- glmmTMB(count ~ mined + (1|site), zi=~mined, family=poisson, data=Salamanders)) summary(cv(m1, seed=97816, k=5, clusterVariables="site")) # 5-fold CV of clusters summary(cv(m1, seed=34506, k=5)) # 5-fold CV of cases
A cv()
method for an object of class "modlist"
,
created by the models()
function. This cv()
method simplifies
the process of cross-validating several models on the same set of CV folds
and may also be used for meta CV, where CV is used to select one from among
several models. models()
performs some
"sanity" checks, warning if the models are of different classes, and
reporting an error if they are fit to apparently different data sets or
different response variables.
## S3 method for class 'modList' cv( model, data, criterion = mse, k, reps = 1L, seed, quietly = TRUE, meta = FALSE, ... ) models(...) ## S3 method for class 'cvModList' print(x, ...) ## S3 method for class 'cvModList' summary(object, ...) ## S3 method for class 'cvModList' plot( x, y, spread = c("range", "sd"), confint = TRUE, xlab = "", ylab, main, axis.args = list(labels = names(x), las = 3L), col = palette()[2L], lwd = 2L, grid = TRUE, ... ) ## S3 method for class 'cvModList' as.data.frame(x, row.names = NULL, optional = TRUE, ...)
## S3 method for class 'modList' cv( model, data, criterion = mse, k, reps = 1L, seed, quietly = TRUE, meta = FALSE, ... ) models(...) ## S3 method for class 'cvModList' print(x, ...) ## S3 method for class 'cvModList' summary(object, ...) ## S3 method for class 'cvModList' plot( x, y, spread = c("range", "sd"), confint = TRUE, xlab = "", ylab, main, axis.args = list(labels = names(x), las = 3L), col = palette()[2L], lwd = 2L, grid = TRUE, ... ) ## S3 method for class 'cvModList' as.data.frame(x, row.names = NULL, optional = TRUE, ...)
model |
a list of regression model objects,
created by |
data |
(required) the data set to which the models were fit. |
criterion |
the CV criterion ("cost" or lack-of-fit) function, defaults to
|
k |
the number of CV folds; may be omitted, in which case the value
will depend on the default for the |
reps |
number of replications of CV for each model (default is 1). |
seed |
(optional) seed for R's pseudo-random-number generator, to be used to create the same set of CV folds for all of the models; if omitted, a seed will be randomly generated and saved. Not used for leave-one-out CV. |
quietly |
if |
meta |
if |
... |
for For For the For the |
x |
an object of class |
object |
an object to summarize. |
y |
the name of the element in each |
spread |
if |
confint |
if |
xlab |
label for the x-axis (defaults to blank). |
ylab |
label for the y-axis (if missing, a label is constructed). |
main |
main title for the graph (if missing, a label is constructed). |
axis.args |
a list of arguments for the |
col |
color for the line and points, defaults to the second
element of the color palette; see |
lwd |
line width for the line (defaults to 2). |
grid |
if |
row.names |
optional row names for the result,
defaults to |
optional |
to match the |
models()
returns a "modList"
object, the
cv()
method for which returns a "cvModList"
object,
or, when meta=TRUE
, an object of class c("cvSelect", "cv")
.
cv(modList)
: cv()
method for "modList"
objects.
models()
: create a list of models.
print(cvModList)
: print()
method for "cvModList"
objects.
summary(cvModList)
: summary()
method for "cvModList"
objects.
plot(cvModList)
: plot()
method for "cvModList"
objects.
as.data.frame(cvModList)
: as.data.frame()
method for
"cvModList"
objects.
cv
, cv.merMod
,
selectModelList
.
if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Duncan", package="carData") m1 <- lm(prestige ~ income + education, data=Duncan) m2 <- lm(prestige ~ income + education + type, data=Duncan) m3 <- lm(prestige ~ (income + education)*type, data=Duncan) (cv.models <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=7949, reps=5)) D.cv.models <- as.data.frame(cv.models) head(D.cv.models) summary(D.cv.models, criterion ~ model + rep, include="folds") plot(cv.models) (cv.models.ci <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=5963, confint=TRUE, level=0.50)) # nb: n too small for accurate CIs plot(cv.models.ci) (cv.models.meta <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=5963, meta=TRUE, save.model=TRUE)) cvInfo(cv.models.meta, "selected model") }) } else { cat("install the 'carData' package to run these examples\n") }
if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Duncan", package="carData") m1 <- lm(prestige ~ income + education, data=Duncan) m2 <- lm(prestige ~ income + education + type, data=Duncan) m3 <- lm(prestige ~ (income + education)*type, data=Duncan) (cv.models <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=7949, reps=5)) D.cv.models <- as.data.frame(cv.models) head(D.cv.models) summary(D.cv.models, criterion ~ model + rep, include="folds") plot(cv.models) (cv.models.ci <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=5963, confint=TRUE, level=0.50)) # nb: n too small for accurate CIs plot(cv.models.ci) (cv.models.meta <- cv(models(m1=m1, m2=m2, m3=m3), data=Duncan, seed=5963, meta=TRUE, save.model=TRUE)) cvInfo(cv.models.meta, "selected model") }) } else { cat("install the 'carData' package to run these examples\n") }
These functions are primarily useful for writing methods for the
cv()
generic function. They are used internally in the package
and can also be used for extensions (see the vignette "Extending the cv package,
vignette("cv-extend", package="cv")
).
cvCompute( model, data = insight::get_data(model), criterion = mse, criterion.name, k = 10L, reps = 1L, seed, details = k <= 10L, confint, level = 0.95, method = NULL, ncores = 1L, type = "response", start = FALSE, f, fPara = f, locals = list(), model.function = NULL, model.function.name = NULL, ... ) cvMixed( model, package, data = insight::get_data(model), criterion = mse, criterion.name, k, reps = 1L, confint, level = 0.95, seed, details, ncores = 1L, clusterVariables, predict.clusters.args = list(object = model, newdata = data), predict.cases.args = list(object = model, newdata = data), fixed.effects, ... ) cvSelect( procedure, data, criterion = mse, criterion.name, model, y.expression, k = 10L, confint = n >= 400, level = 0.95, reps = 1L, save.coef, details = k <= 10L, save.model = FALSE, seed, ncores = 1L, ... ) folds(n, k) fold(folds, i, ...) ## S3 method for class 'folds' fold(folds, i, ...) ## S3 method for class 'folds' print(x, ...) GetResponse(model, ...) ## Default S3 method: GetResponse(model, ...) ## S3 method for class 'merMod' GetResponse(model, ...) ## S3 method for class 'lme' GetResponse(model, ...) ## S3 method for class 'glmmTMB' GetResponse(model, ...) ## S3 method for class 'modList' GetResponse(model, ...)
cvCompute( model, data = insight::get_data(model), criterion = mse, criterion.name, k = 10L, reps = 1L, seed, details = k <= 10L, confint, level = 0.95, method = NULL, ncores = 1L, type = "response", start = FALSE, f, fPara = f, locals = list(), model.function = NULL, model.function.name = NULL, ... ) cvMixed( model, package, data = insight::get_data(model), criterion = mse, criterion.name, k, reps = 1L, confint, level = 0.95, seed, details, ncores = 1L, clusterVariables, predict.clusters.args = list(object = model, newdata = data), predict.cases.args = list(object = model, newdata = data), fixed.effects, ... ) cvSelect( procedure, data, criterion = mse, criterion.name, model, y.expression, k = 10L, confint = n >= 400, level = 0.95, reps = 1L, save.coef, details = k <= 10L, save.model = FALSE, seed, ncores = 1L, ... ) folds(n, k) fold(folds, i, ...) ## S3 method for class 'folds' fold(folds, i, ...) ## S3 method for class 'folds' print(x, ...) GetResponse(model, ...) ## Default S3 method: GetResponse(model, ...) ## S3 method for class 'merMod' GetResponse(model, ...) ## S3 method for class 'lme' GetResponse(model, ...) ## S3 method for class 'glmmTMB' GetResponse(model, ...) ## S3 method for class 'modList' GetResponse(model, ...)
model |
a regression model object. |
data |
data frame to which the model was fit (not usually necessary,
except for |
criterion |
cross-validation criterion ("cost" or lack-of-fit) function of form |
criterion.name |
a character string giving the name of the CV criterion function
in the returned |
k |
perform k-fold cross-validation (default is |
reps |
number of times to replicate k-fold CV (default is |
seed |
for R's random number generator; optional, if not supplied a random seed will be selected and saved; not needed for n-fold cross-validation. |
details |
if |
confint |
if |
level |
confidence level (default |
method |
computational method to apply; use by some |
ncores |
number of cores to use for parallel computations
(default is |
type |
used by some |
start |
used by some |
f |
function to be called by |
fPara |
function to be called by |
locals |
a named list of objects that are required in the local environment
of |
model.function |
a regression function, typically for a new |
model.function.name |
the quoted name of the regression function, e.g.,
|
... |
to match generic; passed to |
package |
the name of the package in which mixed-modeling function (or functions) employed resides; used to get the namespace of the package. |
clusterVariables |
a character vector of names of the variables defining clusters for a mixed model with nested or crossed random effects; if missing, cross-validation is performed for individual cases rather than for clusters |
predict.clusters.args |
a list of arguments to be used to predict
the whole data set from a mixed model when performing CV on clusters;
the first two elements should be
|
predict.cases.args |
a list of arguments to be used to predict
the whole data set from a mixed model when performing CV on cases;
the first two elements should be
|
fixed.effects |
a function to be used to compute fixed-effect
coefficients for cluster-based CV when |
procedure |
a model-selection procedure function (see Details). |
y.expression |
normally the response variable is found from the
|
save.coef |
save the coefficients from the selected models? Deprecated
in favor of the |
save.model |
save the model that's selected using the full data set. |
n |
number of cases, for constructed folds. |
folds |
an object of class |
i |
a fold number for an object of class |
x |
a |
The utility functions return various kinds of objects:
cvCompute()
returns an object of class "cv"
, with the CV criterion
("CV crit"
), the bias-adjusted CV criterion ("adj CV crit"
),
the criterion for the model applied to the full data ("full crit"
),
the confidence interval and level for the bias-adjusted CV criterion ("confint"
),
the number of folds ("k"
), and the seed for R's random-number
generator ("seed"
). If details=TRUE
, then the returned object
will also include a "details"
component, which is a list of two
elements: "criterion"
, containing the CV criterion computed for the
cases in each fold; and "coefficients"
, regression coefficients computed
for the model with each fold deleted. Some cv()
methods calling cvCompute()
may return a subset of these components and may add additional information.
If reps
> 1
, then an object of class "cvList"
is returned,
which is literally a list of "cv"
objects.
cvMixed()
also returns an object of class "cv"
or
"cvList"
.
cvSelect
returns an object of class
"cvSelect"
inheriting from "cv"
, or an object of
class "cvSelectList"
inheriting from "cvList"
.
folds()
returns an object of class folds, for which
there are fold()
and print()
methods.
GetResponse()
returns the (numeric) response variable
from the model.
The supplied default
method returns the model$y
component
of the model object, or, if model
is an S4 object, the result
returned by the get_response()
function in
the insight package. If this result is NULL
, the result of
model.response(model.frame(model))
is returned, checking in any case whether
the result is a numeric vector.
There are also "lme"
, "merMod"
and "glmmTMB"
methods that convert factor
responses to numeric 0/1 responses, as would be appropriate
for a generalized linear mixed model with a binary response.
cvCompute()
: used internally by cv()
methods (not for direct use);
exported to support new cv()
methods.
cvMixed()
: used internally by cv()
methods
for mixed-effect models (not for direct use);
exported to support new cv()
methods.
cvSelect()
: used internally by cv()
methods for
cross-validating a model-selection procedure; may also be called
directly for this purpose, but use via cv()
is preferred.
cvSelect()
is exported primarily to support new model-selection procedures.
folds()
: used internally by cv()
methods (not for direct use).
fold()
: to extract a fold from a "folds"
object.
fold(folds)
: fold()
method for "folds"
objects.
print(folds)
: print()
method for "folds"
objects.
GetResponse()
: function to return the response variable
from a regression model.
GetResponse(default)
: default method.
GetResponse(merMod)
: "merMod"
method.
GetResponse(lme)
: "lme"
method.
GetResponse(glmmTMB)
: "glmmTMB"
method.
GetResponse(modList)
: "modList"
method.
fit <- lm(mpg ~ gear, mtcars) GetResponse(fit) set.seed(123) (ffs <- folds(n=22, k=5)) fold(ffs, 2)
fit <- lm(mpg ~ gear, mtcars) GetResponse(fit) set.seed(123) (ffs <- folds(n=22, k=5)) fold(ffs, 2)
Compute cost functions (cross-validation criteria) for fitted regression models.
mse(y, yhat) rmse(y, yhat) medAbsErr(y, yhat) BayesRule(y, yhat) BayesRule2(y, yhat)
mse(y, yhat) rmse(y, yhat) medAbsErr(y, yhat) BayesRule(y, yhat) BayesRule2(y, yhat)
y |
response |
yhat |
fitted value |
Cost functions (cross-validation criteria) are meant to measure lack-of-fit. Several cost functions are provided:
mse()
returns the mean-squared error of prediction for
a numeric response variable y
and predictions yhat
; and
rmse()
returns the root-mean-squared error and is just the
square-root of mse()
.
medAbsErr()
returns the median absolute error of prediction for a numeric
response y
and predictions yhat
.
BayesRule()
and BayesRule2()
report the proportion
of incorrect predictions for a dichotomous response variable y
, assumed
coded (or coercible to) 0
and 1
. The yhat
values are
predicted probabilities and are rounded to 0 or 1. The distinction
between BayesRule()
and BayesRule2()
is that the former
checks that the y
values are all either 0
or 1
and that the yhat
values are all between 0 and 1, while
the latter doesn't and is therefore faster.
In general, cost functions should return a single numeric
value measuring lack-of-fit. mse()
returns the mean-squared error;
rmse()
returns the root-mean-squared error;
medAbsErr()
returns the median absolute error;
and BayesRule()
and
BayesRule2()
return the proportion of misclassified cases.
mse()
: Mean-square error.
rmse()
: Root-mean-square error.
medAbsErr()
: Median absolute error.
BayesRule()
: Bayes Rule for a binary response.
BayesRule2()
: Bayes rule for a binary response (without bounds checking).
if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Duncan", package="carData") m.lm <- lm(prestige ~ income + education, data=Duncan) mse(Duncan$prestige, fitted(m.lm)) data("Mroz", package="carData") m.glm <- glm(lfp ~ ., data=Mroz, family=binomial) BayesRule(Mroz$lfp == "yes", fitted(m.glm)) }) } else { cat("\n install 'carData' package to run these examples\n") }
if (requireNamespace("carData", quietly=TRUE)){ withAutoprint({ data("Duncan", package="carData") m.lm <- lm(prestige ~ income + education, data=Duncan) mse(Duncan$prestige, fitted(m.lm)) data("Mroz", package="carData") m.glm <- glm(lfp ~ ., data=Mroz, family=binomial) BayesRule(Mroz$lfp == "yes", fitted(m.glm)) }) } else { cat("\n install 'carData' package to run these examples\n") }
This data set appears in Table 3.1 of Diggle, Liang, and Zeger (1994).
data("Pigs", package = "cv")
data("Pigs", package = "cv")
A data frame with 432 rows and 3 columns.
Pig id number, 1–48.
Week number, 1–9.
Weight in kg.
P. J. Diggle, K.-Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data (Oxford, 1994).
library("lme4") m.p <- lmer(weight ~ week + (1 | id) + (1 | week), data=Pigs, REML=FALSE, control=lmerControl(optimizer="bobyqa")) summary(m.p) cv(m.p, clusterVariables=c("id", "week"), k=10, seed=8469)
library("lme4") m.p <- lmer(weight ~ week + (1 | id) + (1 | week), data=Pigs, REML=FALSE, control=lmerControl(optimizer="bobyqa")) summary(m.p) cv(m.p, clusterVariables=c("id", "week"), k=10, seed=8469)