Commit b955da99 authored by Dirk Eddelbuettel's avatar Dirk Eddelbuettel

Import Upstream version 1.7-16

parent 7aa70fc3
Package: mgcv
Version: 1.7-13
Version: 1.7-16
Author: Simon Wood <>
Maintainer: Simon Wood <>
Title: GAMs with GCV/AIC/REML smoothness estimation and GAMMs by PQL
Title: Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness
Description: Routines for GAMs and other generalized ridge regression
with multiple smoothing parameter selection by GCV, REML or
UBRE/AIC. Also GAMMs by REML or PQL. Includes a gam() function.
UBRE/AIC. Also GAMMs. Includes a gam() function.
Priority: recommended
Depends: R (>= 2.14.0), stats, graphics
Imports: nlme, methods, Matrix
......@@ -13,6 +14,6 @@ Suggests: nlme (>= 3.1-64), splines, Matrix, parallel
LazyLoad: yes
ByteCompile: yes
License: GPL (>= 2)
Packaged: 2012-01-21 16:33:06 UTC; sw283
Packaged: 2012-04-30 07:09:06 UTC; sw283
Repository: CRAN
Date/Publication: 2012-01-22 09:59:42
Date/Publication: 2012-04-30 08:22:16
5d5d72ee6d284c96b4525e9eb748bc0f *DESCRIPTION
29e13076f8e7c500f10e2b64b0821984 *NAMESPACE
ecfb144fb5214dde68dffac22f219a1f *R/bam.r
9f562aa60504f1265daa8ff8095b6333 *DESCRIPTION
50152051f123a389d421aa3130dce252 *NAMESPACE
cf4210b25d2ece355a79cb8ed5e4455a *R/bam.r
f4f5c9bb8776c2248e088c9ee3517208 *R/fast-REML.r
b160632e8f38fa99470e2f8cba82a495 *R/gam.fit3.r
902657a0ee2dedc3fdfa501bf3b37c5b *R/gam.sim.r
e137c06cabb48551c18cf0cc3512d297 *R/gamm.r
c61836edb704dbd7b718c754d714a291 *R/mgcv.r
bf70158e37e33ea1136efdeab97569f8 *R/plots.r
d20083082ba7e1bf361ac7d404efd8a3 *R/smooth.r
fe9745f610246ee1f31eb915ca0d76a9 *R/sparse.r
76637934ae66a4b74a0637e698f71469 *changeLog
ad57e83090b4633ee50041fd3571c016 *R/gamm.r
a7d790a4fe2640fd69646e1dcf161d80 *R/mgcv.r
5c5f68e76697c356b95e084bee5d7776 *R/plots.r
bb2b4220a103364afc87249157a040b7 *R/smooth.r
fb66d6c18398411a99ffcb788b854f13 *R/sparse.r
020a4b9253d806cb55b0521412715dc7 *changeLog
e468195a83fab90da8e760c2c3884bd3 *data/columb.polys.rda
40874e3ced720a596750f499ded8a60a *data/columb.rda
88d77139cc983317b6acd8c5f1252ab9 *gnugpl2.txt
......@@ -18,10 +19,10 @@ f693920e12f8a6f2b6cab93648628150 *index
c51c9b8c9c73f81895176ded39b91394 *man/
612ab6354541ebe38a242634d73b66ba *man/Tweedie.Rd
6d711de718a09e1e1ae2a6967abade33 *man/anova.gam.Rd
c4d1ad309698994e7c1c75f7db294a58 *man/bam.Rd
fa6e8f98dc01de508e95d1008ae84d60 *man/bam.Rd
b385d6d5419d0d6aefe03af1a79d5c4e *man/bam.update.Rd
4e925cb579f4693d1b8ec2d5092c0b37 *man/cSplineDes.Rd
9753a8051d9b495d9855537f3f26f491 *man/choose.k.Rd
9b4d616d1b6c4a46ca77d16cded3f806 *man/choose.k.Rd
c03748964ef606621418e428ae49b103 *man/columb.Rd
4196ba59f1fa8449c9cd0cab8a347978 *man/concurvity.Rd
f764fb7cb9e63ff341a0075a3854ab5d *man/exclude.too.far.Rd
......@@ -29,20 +30,20 @@ f764fb7cb9e63ff341a0075a3854ab5d *man/exclude.too.far.Rd
44ad0563add1c560027d502ce41483f5 *man/
75373268c1203ee110e1eede633752aa *man/fixDependence.Rd
9ac808f5a2a43cf97f24798c0922c9bf *man/formXtViX.Rd
34308f4ada8e2aca9981a17794dac30b *man/formula.gam.Rd
bb099e6320a6c1bd79fe4bf59e0fde08 *man/formula.gam.Rd
6f405acde2d7b6f464cf45f5395113ba *man/full.score.Rd
c7f0549fe7b9da0624417e33ed92344d *man/gam.Rd
aeb7ec80d75244bc4f2f2fd796f86efd *man/gam.check.Rd
847599e287ecf79fbb7be2cb06d72742 *man/gam.control.Rd
39b4fa3782cc33a445d34a9b26df44f8 *man/gam.Rd
69c6ef61a3cfc397cbeacd21e5e6cc9b *man/gam.check.Rd
96c9417e4ac5d79ec9ed3f363adfc4e9 *man/gam.control.Rd
fd98327327ba74bb1a61a6519f12e936 *man/gam.convergence.Rd
58ab3b3d6f4fd0d008d73c3c4e6d3305 *man/
32b5cd1b6f63027150817077f3914cf4 *man/gam.fit3.Rd
21339a5d1eb8c83679dd9022ab682b5e *man/gam.fit3.Rd
dd35a8a851460c2d2106c03d544c8241 *man/gam.models.Rd
468d116a2ef9e60f683af48f4f100ef5 *man/gam.outer.Rd
7e5ba69a44bc937ddca04e4f153c7975 *man/gam.selection.Rd
e969287d1a5c281faa7eb6cfce31a7c5 *man/gam.outer.Rd
96676186808802344a99f9d3170bf775 *man/gam.selection.Rd
76651917bd61fc6bc447bbb40b887236 *man/gam.side.Rd
78588cf8ed0af8eca70bba3bbed64dbe *man/gam.vcomp.Rd
278e0b3aa7baa44dfb96e235ceb07f4c *man/gam2objective.Rd
a66a814cc4c6f806e824751fda519ae0 *man/gam2objective.Rd
4d5b3b1266edc31ce3b0e6be11ee9166 *man/gamObject.Rd
0ac5fb78c9db628ce554a8f68588058c *man/gamSim.Rd
6078c49c55f4e7ce20704e4fbe3bba8a *man/gamm.Rd
......@@ -55,30 +56,29 @@ aba56a0341ba9526a302e39d33aa9042 *man/interpret.gam.Rd
58e73ac26b93dc9d28bb27c8699e12cf *man/linear.functional.terms.Rd
5de18c3ad064a5bda4f9027d9455170a *man/logLik.gam.Rd
611f5f6acac9c5f40869c01cf7f75dd3 *man/ls.size.Rd
8ef61987727e1b857edf3a366d21b66c *man/magic.Rd
c4d7e46cead583732e391d680fecc572 *man/magic.Rd
496388445d8cde9b8e0c3917cbe7461d *man/
5c55658a478bd34d66daad46e324d7f4 *man/mgcv-FAQ.Rd
904b19ba280010d85d59a4b21b6d2f94 *man/mgcv-package.Rd
196ad09f09d6a5a44078e2282eb0a56f *man/mgcv.Rd
bb420a39f1f8155f0084eb9260fad89c *man/mgcv.control.Rd
d564d1c5b2f780844ff10125348f2e2c *man/mgcv-FAQ.Rd
41df245a5821b3964db4c74b1930c0fe *man/mgcv-package.Rd
18a9858b6f3ffde288b0bf9e1a5da2f6 *man/model.matrix.gam.Rd
3edd2618dcb4b366eeb405d77f3f633c *man/mono.con.Rd
bc9b89db7e7ff246749551c16f5f1f07 *man/mono.con.Rd
3a4090ac778273861d97077681a55df2 *man/mroot.Rd
8aea04d0764d195409da798b33516051 *man/negbin.Rd
41de8762baab4fc0cf1224df168520fe *man/
dffa2d51c704c610088fa02d7220b05e *man/notExp.Rd
150d7f8a427117353c5c2e466ff0bfae *man/notExp2.Rd
95b3e6686e9557b3278e21e350704ce9 *man/
3720c8867aa31d7705dae102eeaa2364 *man/pcls.Rd
19939543d691f128e84d86fb5423541e *man/pcls.Rd
717d796acbaab64216564daf898b6d04 *man/pdIdnot.Rd
8c0f8575b427f30316b639a326193aeb *man/pdTens.Rd
b388d29148264fd3cd636391fde87a83 *man/pen.edf.Rd
de454d1dc268bda008ff46639a89acec *man/place.knots.Rd
ced71ada93376fcdffa28ad08009cf49 *man/plot.gam.Rd
84d54e8081b82cb8d96a33de03741843 *man/plot.gam.Rd
3d1484b6c3c2ea93efe41f6fc3801b8d *man/polys.plot.Rd
fdd6b7e03fde145e274699fe9ea8996c *man/predict.gam.Rd
afca36f5b1a5d06a7fcab2eaaa029e7e *man/predict.bam.Rd
df63d7045f83a1dc4874fcac18a2303c *man/predict.gam.Rd
a594eb641cae6ba0b83d094acf4a4f81 *man/print.gam.Rd
d837c87f037760c81906a51635476298 *man/qq.gam.Rd
5311a1e83ae93aef5f9ae38f7492536a *man/qq.gam.Rd
f77ca1471881d2f93c74864d076c0a0e *man/rTweedie.Rd
827743e1465089a859a877942ba2f4a9 *man/random.effects.Rd
37669f97e17507f3ae2d6d1d74feb9d7 *man/residuals.gam.Rd
......@@ -97,12 +97,12 @@ d202c6718fb1138fdd99e6102250aedf *man/
8672633a1fad8df3cb1f53d7fa883620 *man/smooth.construct.tensor.smooth.spec.Rd
4b9bd43c3acbab6ab0159d59967e19db *man/
1de9c315702476fd405a85663bb32d1c *man/smooth.terms.Rd
6aa3bcbd3198d2bbc3b9ca12c9c9cd7e *man/smoothCon.Rd
0d12daea17e0b7aef8ab89b5f801adf1 *man/smoothCon.Rd
5ae47a140393009e3dba7557af175170 *man/sp.vcov.Rd
83bd8e097711bf5bd0fff09822743d43 *man/spasm.construct.Rd
a17981f0fa2a6a50e637c98c672bfc45 *man/step.gam.Rd
700699103b50f40d17d3824e35522c85 *man/step.gam.Rd
dd54c87fb87c284d3894410f50550047 *man/summary.gam.Rd
22b571cbc0bd1e31f195ad927434c27e *man/t2.Rd
7f383eaaca246c8bf2d5b74d841f7f8a *man/t2.Rd
04076444b2c99e9287c080298f9dc1d7 *man/te.Rd
c3c23641875a293593fe4ef032b44aae *man/
fbd45cbb1931bdb5c0de044e22fdd028 *man/uniquecombs.Rd
......@@ -117,20 +117,18 @@ becbe3e1f1588f7292a74a97ef07a9ae *po/R-de.po
1a4a267ddcb87bb83f09c291d3e97523 *po/fr.po
813514ea4e046ecb4563eb3ae8aa202a *po/mgcv.pot
cd54024d76a9b53dc17ef26323fc053f *src/Makevars
a25e39145f032e8e37433651bba92ddf *src/gcv.c
2798411be2cb3748b8bd739f2d2016ee *src/gcv.h
d40012dcda1a10ee535a9b3de9b46c19 *src/gdi.c
94a2bcbb75cc60e8460e72ed154678c9 *src/gdi.c
49af97195accb65adc75620183d39a4c *src/general.h
da280ee5538a828afde0a4f6c7b8328a *src/init.c
8b37eb0db498a3867dc83364dc65f146 *src/magic.c
6f301e977834b4743728346184ea11ba *src/init.c
7f9fcb495707a003817e78f4802ceeba *src/magic.c
066af9db587e5fe6e5cc4ff8c09ae9c2 *src/mat.c
d21847ac9a1f91ee9446c70bd93a490a *src/matrix.c
54ce9309b17024ca524e279612a869d6 *src/matrix.h
08c94a2af4cd047ecd79871ecbafe33a *src/mgcv.c
99204b3b20c2e475d9e14022e0144804 *src/mgcv.h
2a1c4f1c10510a4338e5cc34defa65f6 *src/misc.c
de0ae24ea5cb533640a3ab57e0383595 *src/matrix.c
0f8448f67d16668f9027084a2d9a1b52 *src/matrix.h
6a9f57b44d2aab43aa32b01ccb26bd6e *src/mgcv.c
c62652f45ad1cd3624a849005858723a *src/mgcv.h
fcbe85d667f8c7818d17509a0c3c5935 *src/misc.c
7e0ba698a21a01150fda519661ef9857 *src/qp.c
cd563899be5b09897d1bf36a7889caa0 *src/qp.h
e9cab4a461eb8e086a0e4834cbf16f30 *src/sparse-smooth.c
3a251ecac78b25c315de459cd2ba0b04 *src/tprs.c
d0531330f4c1209a1cdd7a75b1854724 *src/tprs.h
985ef1e19c7b5d97b8e29ed78e709fc5 *src/tprs.c
5352d5d2298acd9b03ee1895933d4fb4 *src/tprs.h
......@@ -12,7 +12,7 @@ export(anova.gam, bam, bam.update, concurvity, cSplineDes,
magic,, mgcv, mgcv.control, model.matrix.gam,
magic,, model.matrix.gam,
mono.con, mroot, negbin,,
This diff is collapsed.
This diff is collapsed.
......@@ -733,7 +733,7 @@ smooth2random.tensor.smooth <- function(object,vnames,type=1) {
## first sort out the re-parameterization...
sum.S <- object$S[[1]]/mean(abs(object$S[[1]]))
null.rank <- ncol(object$margin[[1]]$X)-object$margin[[1]]$rank ## null space rank
bs.dim <- object$margin[[1]]$bs.dim
bs.dim <- ncol(object$margin[[1]]$X)
if (length(object$S)>1) for (l in 2:length(object$S)) {
sum.S <- sum.S + object$S[[l]]/mean(abs(object$S[[l]]))
dfl <- ncol(object$margin[[l]]$X) ## actual df of term (`df' may not be set by constructor)
This diff is collapsed.
......@@ -163,8 +163,88 @@ qq.gam <- function(object, rep=0, level=.9,s.rep=10,
k.check <- function(b,subsample=5000,n.rep=400) {
## function to check k in a gam fit...
## does a randomization test looking for evidence of residual
## pattern attributable to covariates of each smooth.
m <- length(b$smooth)
if (m==0) return(NULL)
rsd <- residuals(b)
ve <- rep(0,n.rep)
p.val<-v.obs <- kc <- edf<- rep(0,m)
snames <- rep("",m)
n <- nrow(b$model)
if (n>subsample) { ## subsample to avoid excessive cost
ind <- sample(1:n,subsample)
modf <- b$model[ind,]
rsd <- rsd[ind]
} else modf <- b$model
nr <- length(rsd)
for (k in 1:m) { ## work through smooths
dat <-$smooth[[k]],modf,NULL)$data)
snames[k] <- b$smooth[[k]]$label
ind <- b$smooth[[k]]$first.para:b$smooth[[k]]$last.para
kc[k] <- length(ind)
edf[k] <- sum(b$edf[ind])
nc <- b$smooth[[k]]$dim
ok <- TRUE
for (j in 1:nc) if (is.factor(dat[[j]])) ok <- FALSE
if (!is.null(attr(dat[[1]],"matrix"))) ok <- FALSE
if (!ok) {
p.val[k] <- v.obs[k] <- NA ## can't do this test with summation convention/factors
} else { ## normal term
if (nc==1) { ## 1-D term
e <- diff(rsd[order(dat[,1])])
v.obs[k] <- mean(e^2)/2
for (i in 1:n.rep) {
e <- diff(rsd[sample(1:nr,nr)]) ## shuffle
ve[i] <- mean(e^2)/2
p.val[k] <- mean(ve<v.obs[k])
v.obs[k] <- v.obs[k]/mean(rsd^2)
} else { ## multi-D
if (!is.null(b$smooth[[k]]$margin)) { ## tensor product (have to consider scaling)
## get the scale factors...
beta <- coef(b)[ind]
f0 <- PredictMat(b$smooth[[k]],dat)%*%beta
gr.f <- rep(0,ncol(dat))
for (i in 1:nc) {
datp <- dat;dx <- diff(range(dat[,i]))/1000
datp[,i] <- datp[,i] + dx
fp <- PredictMat(b$smooth[[k]],datp)%*%beta
gr.f[i] <- mean(abs(fp-f0))/dx
for (i in 1:nc) { ## rescale distances
dat[,i] <- dat[,i] - min(dat[,i])
dat[,i] <- gr.f[i]*dat[,i]/max(dat[,i])
nn <- 3
ni <- mgcv:::nearest(nn,as.matrix(dat))$ni
e <- rsd - rsd[ni[,1]]
for (j in 2:nn) e <- c(e,rsd-rsd[ni[,j]])
v.obs[k] <- mean(e^2)/2
for (i in 1:n.rep) {
rsdr <- rsd[sample(1:nr,nr)] ## shuffle
e <- rsdr - rsdr[ni[,1]]
for (j in 2:nn) e <- c(e,rsdr-rsdr[ni[,j]])
ve[i] <- mean(e^2)/2
p.val[k] <- mean(ve<v.obs[k])
v.obs[k] <- v.obs[k]/mean(rsd^2)
k.table <- cbind(kc,edf,v.obs, p.val)
dimnames(k.table) <- list(snames, c("k\'","edf","k-index", "p-value"))
} ## end of k.check
gam.check <- function(b,,
## arguments passed to qq.gam() {w/o warnings !}:
rep=0, level=.9, rl.col=2, rep.col="gray80", ...)
# takes a fitted gam object and produces some standard diagnostic plots
......@@ -183,7 +263,7 @@ gam.check <- function(b,,
hist(resid,xlab="Residuals",main="Histogram of residuals",...)
plot(fitted(b), napredict(b$na.action, b$y),
xlab="Fitted Values",ylab="Response",main="Response vs. Fitted Values",...)
if (!(b$method%in%c("GCV","GACV","UBRE","REML","ML","P-ML","P-REML"))) { ## gamm `gam' object
if (!(b$method%in%c("GCV","GACV","UBRE","REML","ML","P-ML","P-REML","fREML"))) { ## gamm `gam' object
......@@ -219,6 +299,13 @@ gam.check <- function(b,,
## now check k
kchck <- k.check(b,subsample=k.sample,n.rep=k.rep)
if (!is.null(kchck)) {
cat("Basis dimension (k) checking results. Low p-value (k-index<1) may\n")
cat("indicate that k is too low, especially if edf is close to k\'.\n\n")
## } else plot(linpred,resid,xlab="linear predictor",ylab="residuals",...)
} ## end of gam.check
This diff is collapsed.
......@@ -119,7 +119,7 @@ kd.vis <- function(X,cex=.5) {
nearest <- function(k,X,get.a=FALSE,balanced=FALSE, {
nearest <- function(k,X, = FALSE,get.a=FALSE,balanced=FALSE, {
## The rows of X contain coordinates of points.
## For each point, this routine finds its k nearest
## neighbours, returning a list of 2, n by k matrices:
......@@ -133,8 +133,14 @@ nearest <- function(k,X,get.a=FALSE,balanced=FALSE, {
## for neighbours chosen to be on either side of the box in each
## direction in this case k>2*ncol(X). These neighbours are only used
## if closer than*max(k nearest distances).
## indicates that neighbours must have distances greater
## than zero...
if (balanced) <- TRUE
if ( {
Xu <- uniquecombs(X);ind <- attr(Xu,"index") ## Xu[ind,] == X
} else { Xu <- X; ind <- 1:nrow(X)}
if (k>nrow(Xu)) stop("not enough unique values to find k nearest")
nobs <- length(ind)
n <- nrow(Xu)
d <- ncol(Xu)
......@@ -154,7 +160,8 @@ nearest <- function(k,X,get.a=FALSE,balanced=FALSE, {
rind <- 1:nobs
rind[ind] <- 1:nobs
ni <- matrix(rind[oo$ni+1],n,k)[ind,]
if (get.a) a=oo$a[ind] else a <- NULL
** denotes quite substantial/important changes
*** denotes really big changes
* There was an unitialized variable bug in the 1.7-14 re-written "cr" basis
code for the case k=3. Fixed.
* gam.check modified slightly so that k test only applied to smooths of
numeric variables, not factors.
* Several packages had documentation linking to the 'mgcv' function
help page (now removed), when a link to the package was meant. An alias
has been added to mgcv-package.Rd to fix/correct these links.
** predict.bam now added as a wrapper for predict.gam, allowing parallel
** bam now has method="fREML" option which uses faster REML optimizer:
can make a big difference on parameter rich models.
* bam can now use a cross product and Choleski based method to accumulate
the required model matrix factorization. Faster, but less stable than
the QR based default.
* bam can now obtain starting values using a random sub sample of the data.
Useful for seriously large datasets.
* check of adequacy of basis dimensions added to gam.check
* magic can now deal with model matrices with more columns than rows.
* p-value reference distribution approximations improved.
* bam returns objects of class "bam" inheriting from "gam"
* bam now uses newdata.guaranteed=TRUE option when predicting as part
of model matrix decomposition accumulation. Speeds things up.
* More efficient `sweep and drop' centering constraints added as default for
bam. Constaint null space unchanged, but computation is faster.
* Underlying "cr" basis code re-written for greater efficiency.
* routine mgcv removed, it now being many years since there has been any
reason to use it. C source code heavily pruned as a result.
* coefficient name generation moved from estimate.gam to gam.setup.
* smooth2random.tensor.smooth had a bug that could produce a nonsensical
penalty null space rank and an error, in some cases (e.g. "cc" basis)
causing te terms to fail in gamm. Fixed.
* minor change to te constructor. Any unpenalized margin now has
corresponding penalty rank dropped along with penalty.
* Code for handling sp's fixed at exactly zero was badly thought out, and
could easily fail. fixed.
* TPRS prediction code made more efficient, partly by use of BLAS. Large
dataset setup also made more efficient using BLAS.
* smooth.construct.tensor.smooth.spec now handles marginals with factor
arguments properly (there was a knot generation bug in this case)
* bam now uses LAPACK version of qr, for model matrix QR, since it's
faster and uses BLAS.
** The Lanczos routine in mat.c was using a stupidly inefficient check for
......@@ -15,9 +15,10 @@ for large datasets. \code{bam} can also compute on a cluster set up by the \link
na.action=na.omit, offset=NULL,method="REML",control=list(),
na.action=na.omit, offset=NULL,method="fREML",control=list(),
%- maybe also `usage' for other objects documented here.
......@@ -60,7 +61,8 @@ included in \code{formula}: this conforms to the behaviour of
\item{method}{The smoothing parameter estimation method. \code{"GCV.Cp"} to use GCV for unknown scale parameter and
Mallows' Cp/UBRE/AIC for known scale. \code{"GACV.Cp"} is equivalent, but using GACV in place of GCV. \code{"REML"}
for REML estimation, including of unknown scale, \code{"P-REML"} for REML estimation, but using a Pearson estimate
of the scale. \code{"ML"} and \code{"P-ML"} are similar, but using maximum likelihood in place of REML. }
of the scale. \code{"ML"} and \code{"P-ML"} are similar, but using maximum likelihood in place of REML. Default
\code{"fREML"} uses fast REML computation.}
\item{control}{A list of fit control parameters to replace defaults returned by
\code{\link{gam.control}}. Any control parameters not supplied stay at their default values.}
......@@ -116,6 +118,14 @@ single machine). See details and example code.
\item{gc.level}{to keep the memory footprint down, it helps to call the garbage collector often, but this takes
a substatial amount of time. Setting this to zero means that garbage collection only happens when R decides it should. Setting to 2 gives frequent garbage collection. 1 is in between.}
\item{use.chol}{By default \code{bam} uses a very stable QR update approach to obtaining the QR decomposition
of the model matrix. For well conditioned models an alternative accumulates the crossproduct of the model matrix
and then finds its Choleski decomposition, at the end. This is somewhat more efficient, computationally.}
\item{samfrac}{For very large sample size Generalized additive models the number of iterations needed for the model fit can
be reduced by first fitting a model to a random sample of the data, and using the results to supply starting values. This initial fit is run with sloppy convergence tolerances, so is typically very low cost. \code{samfrac} is the sampling fraction to use. 0.1 is often reasonable. }
\item{...}{further arguments for
passing on e.g. to \code{} (such as \code{mustart}). }
......@@ -184,18 +194,18 @@ The negbin family is only supported for the *known theta* case.
\code{\link{linear.functional.terms}}, \code{\link{s}},
\code{\link{te}} \code{\link{predict.gam}},
\code{\link{plot.gam}}, \code{\link{summary.gam}}, \code{\link{gam.side}},
\code{\link{gam.selection}},\code{\link{mgcv}}, \code{\link{gam.control}}
\code{\link{gam.selection}}, \code{\link{gam.control}}
\code{\link{gam.check}}, \code{\link{linear.functional.terms}} \code{\link{negbin}}, \code{\link{magic}},\code{\link{vis.gam}}
## following is not *very* large, for obvious reasons...
## Some moderately large examples...
dat <- gamSim(1,n=15000,dist="normal",scale=20)
bs <- "ps";k <- 20
dat <- gamSim(1,n=100000,dist="normal",scale=20)
bs <- "cr";k <- 20
b <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
plot(b,pages=1,rug=FALSE) ## plot smooths, but not rug
plot(b,pages=1,rug=FALSE,seWithMean=TRUE) ## `with intercept' CIs
......@@ -206,7 +216,7 @@ summary(ba)
## A Poisson example...
dat <- gamSim(1,n=15000,dist="poisson",scale=.1)
dat <- gamSim(1,n=35000,dist="poisson",scale=.1)
system.time(b1 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
......@@ -227,13 +237,16 @@ system.time(b2 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
system.time(b2 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
fv <- predict(b2,cluster=cl) ## parallel prediction
if (!is.null(cl)) stopCluster(cl)
## Sparse smoothers example...
b3 <- bam(y ~ te(x0,x1,bs="ps",k=10,np=FALSE)+s(x2,bs="ps",k=30)+
## Sparse smoother example...
dat <- gamSim(1,n=10000,dist="poisson",scale=.1)
system.time( b3 <- bam(y ~ te(x0,x1,bs="ps",k=10,np=FALSE)+
......@@ -42,6 +42,10 @@ doing this, then \code{k} was large enough. (Change in the smoothness selection
and/or the effective degrees of freedom, when \code{k} is increased, provide the obvious
numerical measures for whether the fit has changed substantially.)
\code{\link{gam.check}} runs a simple simulation based check on the basis dimensions, which can
help to flag up terms for which \code{k} is too low. Grossly too small \code{k}
will also be visible from partial residuals available with \code{\link{plot.gam}}.
One scenario that can cause confusion is this: a model is fitted with
\code{k=10} for a smooth term, and the EDF for the term is estimated as 7.6,
some way below the maximum of 9. The model is then refitted with \code{k=20}
......@@ -68,14 +72,17 @@ Wood, S.N. (2006) Generalized Additive Models: An Introduction with R. CRC.
## Simulate some data ....
dat <- gamSim(1,n=400,scale=2)
## fit a GAM with quite low `k'
plot(b,pages=1,residuals=TRUE) ## hint of a problem in s(x2)
## Economical tactic (see below for more obvious approach)....
## the following suggests a problem with s(x2)
## Another approach (see below for more obvious method)....
## check for residual pattern, removeable by increasing `k'
## typically `k', below, chould be substantially larger than
## the original, `k' but certainly less than n/2.
......@@ -87,6 +94,10 @@ gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam(rsd~s(x2,k=40,bs="cs"),gamma=1.4,data=dat) ## `k' too low
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
## refit...
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=20)+s(x3,k=6),data=dat)
gam.check(b) ## better
## similar example with multi-dimensional smooth
b1 <- gam(y~s(x0)+s(x1,x2,k=15)+s(x3),data=dat)
rsd <- residuals(b1)
......@@ -94,6 +105,8 @@ gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam(rsd~s(x1,x2,k=100,bs="ts"),gamma=1.4,data=dat) ## `k' too low
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam.check(b1) ## shows same problem
## and a `te' example
b2 <- gam(y~s(x0)+te(x1,x2,k=4)+s(x3),data=dat)
rsd <- residuals(b2)
......@@ -101,10 +114,15 @@ gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam(rsd~te(x1,x2,k=10,bs="cs"),gamma=1.4,data=dat) ## `k' too low
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam.check(b2) ## shows same problem
## same approach works with other families in the original model
dat <- gamSim(1,n=400,scale=.25,dist="poisson")
rsd <- residuals(bp)
gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
......@@ -26,7 +26,13 @@ Smooth terms are specified by expressions of the form: \cr
where \code{x1}, \code{x2}, etc. are the covariates which the smooth
is a function of, and \code{k} is the dimension of the basis used to
represent the smooth term. If \code{k} is not
specified then basis specific defaults are used.
specified then basis specific defaults are used. Note that these defaults are
essentially arbitrary, and it is important to check that they are not so
small that they cause oversmoothing (too large just slows down computation).
Sometimes the modelling context suggests sensible values for \code{k}, but if not
informal checking is easy: see \code{\link{choose.k}} and \code{\link{gam.check}}.
\code{fx} is used to indicate whether or not this term should be unpenalized,
and therefore have a fixed number of degrees of freedom set by \code{k}
(almost always \code{k-1}). \code{bs} indicates the basis to use for the smooth:
......@@ -25,14 +25,13 @@ differences are (i) that by default estimation of the
degree of smoothness of model terms is part of model fitting, (ii) a
Bayesian approach to variance estimation is employed that makes for easier
confidence interval calculation (with good coverage probabilities), (iii) that the model
can depend on any (bounded) linear functional of smooth terms,
and, (iv) the parametric part of the model can be penalized, (v) simple random effects can be incorporated, and
can depend on any (bounded) linear functional of smooth terms, (iv) the parametric part of the model can be penalized, (v) simple random effects can be incorporated, and
(vi) the facilities for incorporating smooths of more than one variable are
different: specifically there are no \code{lo} smooths, but instead (a) \code{s}
terms can have more than one argument, implying an isotropic smooth and (b) \code{te} smooths are
provided as an effective means for modelling smooth interactions of any
number of variables via scale invariant tensor product smooths. If you want
a clone of what S-PLUS provides use \link[gam]{gam} from package \code{gam}.
number of variables via scale invariant tensor product smooths. See \link[gam]{gam}
from package \code{gam}, for GAMs via the original Hastie and Tibshirani approach.
For very large datasets see \code{\link{bam}}, for mixed GAM see \code{\link{gamm}} and \code{\link{random.effects}}.
......@@ -279,7 +278,11 @@ generalized additive mixed models. Biometrics 62(4):1025-1036
Wood S.N. (2006b) Generalized Additive Models: An Introduction with R. Chapman
and Hall/CRC Press.
Wood, S.N. (2006c) On confidence intervals for generalized additive models based on penalized regression splines. Australian and New Zealand Journal of Statistics. 48(4): 445-464.
Wood S.N., F. Scheipl and J.J. Faraway (2012) Straightforward intermediate rank tensor product smoothing
in mixed models. Statistical Computing.
Marra, G and S.N. Wood (2012) Coverage Properties of Confidence Intervals for Generalized Additive
Model Components. Scandinavian Journal of Statistics, 39(1), 53-74.
Key Reference on GAMs and related models:
......@@ -327,6 +330,10 @@ Wahba (e.g. 1990) and Gu (e.g. 2002).
\section{WARNINGS }{
The default basis dimensions used for smooth terms are essentially arbitrary, and
it should be checked that they are not too small. See \code{\link{choose.k}} and
You must have more unique combinations of covariates than the model has total
parameters. (Total parameters is sum of basis dimensions plus sum of non-spline
terms less the number of spline terms).
......@@ -344,18 +351,21 @@ an infinite range on the linear predictor scale.
\code{\link{linear.functional.terms}}, \code{\link{s}},
\code{\link{te}} \code{\link{predict.gam}},
\code{\link{plot.gam}}, \code{\link{summary.gam}}, \code{\link{gam.side}},
\code{\link{gam.selection}},\code{\link{mgcv}}, \code{\link{gam.control}}
\code{\link{gam.selection}}, \code{\link{gam.control}}
\code{\link{gam.check}}, \code{\link{linear.functional.terms}} \code{\link{negbin}}, \code{\link{magic}},\code{\link{vis.gam}}
set.seed(0) ## simulate some data...
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b,pages=1,residuals=TRUE) ## show partial residuals
plot(b,pages=1,seWithMean=TRUE) ## `with intercept' CIs
## run some basic model checks, including checking
## smoothing basis dimensions...
## same fit in two parts .....
G <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),fit=FALSE,data=dat)
......@@ -3,11 +3,13 @@
\title{Some diagnostics for a fitted gam model}
\description{ Takes a fitted \code{gam} object produced by \code{gam()} and produces some diagnostic information
about the fitting procedure and results. The default is to produce 4 residual
plots, and some information about the convergence of the smoothness selection optimization.
plots, some information about the convergence of the smoothness selection optimization, and to run
diagnostic tests of whether the basis dimension choises are adequate.
rep=0, level=.9, rl.col=2, rep.col="gray80", \dots)
......@@ -15,6 +17,8 @@ gam.check(b,,
\item{}{If you want old fashioned plots, exactly as in Wood, 2006, set to \code{TRUE}.}
\item{type}{type of residuals, see \code{\link{residuals.gam}}, used in
all plots.}
\item{k.sample}{Above this k testing uses a random sub-sample of data.}
\item{k.rep}{how many re-shuffles to do to get p-value for k testing.}
\item{rep, level, rl.col, rep.col}{
arguments passed to \code{\link{qq.gam}()} when \code{} is
false, see there.}
......@@ -23,13 +27,19 @@ gam.check(b,,
\value{A vector of reference quantiles for the residual distribution, if these can be computed.}
\details{ This function plots 4 standard diagnostic plots, and some other
convergence diagnostics. Usually the 4 plots are various residual plots. The
printed information relates to the optimization used to select smoothing
parameters. For the default optimization methods the information is summarized in a
\details{ This function plots 4 standard diagnostic plots, some smoothing parameter estimation
convergence information and the results of tests which may indicate if the smoothing basis dimension
for a term is too low.
Usually the 4 plots are various residual plots. For the default optimization methods the convergence information is summarized in a
readable way, but for other optimization methods, whatever is returned by way of
convergence diagnostics is simply printed.
The test of whether the basis dimension for a smooth is adequate is based on computing an estimate of the residual variance
based on differencing residuals that are near neighbours according to the (numeric) covariates of the smooth. This estimate divided by the residual variance is the \code{k-index} reported. The further below 1 this is, the more likely it is that there is missed pattern left in the residuals. The \code{p-value} is computed by simulation: the residuals are randomly re-shuffled \code{k.rep} times to obtain the null distribution of the differencing variance estimator, if there is no pattern in the residuals. For models fitted to more than \code{k.sample} data, the tests are based of \code{k.sample} randomly sampled data. Low p-values may indicate that the basis dimension, \code{k}, has been set too low, especially if the reported \code{edf} is close to \code{k\'}, the maximum possible EDF for the term. Note the disconcerting fact that if the test statistic itself is based on random resampling and the null is true, then the associated p-values will of course vary widely from one replicate to the next. Currently smooths of factor variables are not supported and will give an \code{NA} p-value.
Doubling a suspect \code{k} and re-fitting is sensible: if the reported \code{edf} increases substantially then you may have been missing something in the first fit. Of course p-values can be low for reasons other than a too low \code{k}. See \code{\link{choose.k}} for fuller discussion.
The QQ plot produced is usually created by a call to \code{\link{qq.gam}}, and plots deviance residuals
against approximate theoretical quantilies of the deviance residual distribution, according to the fitted model.
If this looks odd then investigate further using \code{\link{qq.gam}}. Note that residuals for models fitted to binary data contain very little
......@@ -40,6 +50,9 @@ to be useful in this case.
N.H. Augustin, E-A Sauleaub, S.N. Wood (2012) On quantile quantile plots for generalized linear models
Computational Statistics & Data Analysis.
Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman
and Hall/CRC Press.
......@@ -50,7 +63,7 @@ and Hall/CRC Press.
\author{ Simon N. Wood \email{}}
\seealso{ \code{\link{choose.k}}, \code{\link{gam}}, \code{\link{mgcv}}, \code{\link{magic}}}
\seealso{ \code{\link{choose.k}}, \code{\link{gam}}, \code{\link{magic}}}
......@@ -44,8 +44,7 @@ the number of halvings to try before giving up.}
\item{rank.tol}{The tolerance used to estimate the rank of the fitting
problem, for methods which deal with rank deficient cases (basically all
except those based on \code{\link{mgcv}}).}
\item{nlm}{list of control parameters to pass to \code{\link{nlm}} if this is
used for outer estimation of smoothing parameters (not default). See details.}
......@@ -135,7 +135,7 @@ by the R core (see \code{\link{}} for further credits).
\seealso{\code{\link{}}, \code{\link{gam}}, \code{\link{mgcv}}, \code{\link{magic}}}
\seealso{\code{\link{}}, \code{\link{gam}}, \code{\link{magic}}}
\keyword{models} \keyword{smooth} \keyword{regression}%-- one or more ...