File: regularization.Rmd

package info (click to toggle)
r-cran-openmx 2.21.1%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 14,412 kB
sloc: cpp: 36,577; ansic: 13,811; fortran: 2,001; sh: 1,440; python: 350; perl: 21; makefile: 5
file content (126 lines) | stat: -rw-r--r-- 3,774 bytes
parent folder | download | duplicates (4)
---
title: "Regularized SEM"
author: "Joshua Pritikin and Ross Jacobucci"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::knitr}
  %\VignetteIndexEntry{Regularization}
  %\usepackage[UTF-8]{inputenc}
---

```{r, include = FALSE}
is_CRAN <- !identical(Sys.getenv("NOT_CRAN"), "true")
if (!is_CRAN) {
   options(mc.cores = parallel::detectCores())
} else {
  knitr::opts_chunk$set(eval = FALSE)
  knitr::knit_hooks$set(evaluate.inline = function(x, envir) x)
}
```
### Simulate Data

We will first simulate data

```{r}
library(OpenMx)

manifests <- c(paste0('x',1:8), paste0('y',1:5))

set.seed(12)

sim.mod <- mxModel(
  "sim", type="RAM", manifestVars = manifests, latentVars = 'f1',
  mxPath(paste0('x',1:8), 'f1', values=c(0,0,0,0,0,.2,.5,.8),
         labels=paste0('c',1:8)),
  mxPath('f1', paste0('y',1:5), values=1),
  mxPath(paste0('x',1:8), arrows=2, connect = "unique.bivariate",
         values=rnorm(8*7/2, sd=.2)),
  mxPath(paste0('x',1:8), arrows=2, values=1),
  mxPath(paste0('y',1:5), arrows=2, values=1),
  mxPath('f1', arrows=2, values=1, free=FALSE),
  mxPath('one', manifests),
  mxPath('one', 'f1', free=FALSE))

dat.sim = mxGenerateData(sim.mod, nrows = 100)

```

### Run the Model

And then run the model so we can better see the structure.

```{r}
run.mod <- mxModel(sim.mod, mxData(dat.sim, type="raw"))

fit <- mxRun(run.mod)
#summary(fit)
summary(fit)$parameters[1:8,]
```



### Regularize

One of the difficult pieces in using regularization is that the penalty has to be calibrated for each particular problem. In running this code, I first tested the default, but this was too small given that there were some parameters > 0.4. After plotting this initial run, I saw that some parameters weren't penalized enough, therefore, I increased the penalty step to 1.2 and with 41 different values. This tested a model that had most estimates as zero. In some cases, it isn't necessary to test a sequence of penalties that would set "large" parameters to zero, as either the model could fail to converge then, or there is not uncertainty about those parameters inclusion.

```{r,results='hide'}
regFit <- mxPenaltySearch(mxModel(
  fit, mxPenaltyLASSO(paste0('c',1:8),"lasso",lambda.step=1.2),
  mxMatrix('Full',1,1,free=TRUE,values=0, labels="lambda")))

```
A status code 6 warning is issued because the parameters affected
by regularization have relatively large gradients.

```{r}
round(regFit$output$gradient, 2)
```


The gradient check is only done for the model with the best EBIC.
Next, we can get a summary of the models tested to check if there
were any optimization failures.

```{r}

detail <- regFit$compute$steps$PS$output$detail
table(detail$statusCode)

```

Looks good. We can also look at summaries of some of the results,

```{r}

range(detail$lambda)
range(detail$EBIC)
best <- which(detail$EBIC == min(detail$EBIC))
detail[best, 'lambda']

```

## Plot the parameter trajectories

```{r,fig.width=5,fig.height=5}

library(reshape2)
library(ggplot2)

est <- detail[,c(paste0('c',1:8), 'lambda')]
ggplot(melt(est, id.vars = 'lambda')) +
  geom_line(aes(x=lambda, y=value, color=variable)) +
  geom_vline(aes(xintercept=coef(regFit)['lambda']),
             linetype="dashed", alpha=.5)

```

Here we can see that we used a large enough penalty to set most parameter estimates to zero. The best fitting model is indicated by the dashed lines.

OpenMx uses EBIC to choose a final model. See what the best fitting parameter estimates are.

```{r}
summary(regFit)$parameters[1:8,]
```

In this final model, we set the regression paths for x1, x2, x3, x4, x5, and x6 to zero. We also correctly identify x7 and x8 as true paths. Compare these results with the maximum likelihood estimates.