März 2015 – ahoi data

Drawing path diagrams of structural equation models (SEM) for publication

2015-03-20 by Niels

Visualisation of structural equation models is done with path diagrams. They are an important means to give your audience an easier access to the equation system, that represents the theory you want to test. A path diagram is kind of like a flow-chart that uses arrows to show direct and indirect causal links between your exogenous and endogenous variables, as well as your latent and your observed variables. As structural equation models can become complex and contain a lot of parameters to describe the relationships between observed and latent variables, it´s an important step to visualize them properly. The automatically produced path-diagrams are often good enough as you work out your model, but they´re not polished enough for publication. In this post, i´ll show a selection of tools and their output.

There are many software solutions to do structural equation modeling. LISREL, AMOS, MPLUS, STATA, SAS, EQS and the R-packages sem, OpenMX, lavaan, Onyx – just to name the most popular ones. Most of these solutions have a built-in possibility to visualize their models. AMOS is a special case, because the modeling is done via drawing path diagrams. Onyx can do this, too. This can make it easy, especially for beginners. Sometimes you can find these AMOS path diagrams beeing published in articles.

In my experience the other SEM-tools (LISREL,MPLUS,STATA) don´t produce very appealing diagrams. Especially if your model is a little bigger. When it comes to the R-packages, there are significantly better attempts to generate visualisations of structural equation models. As a third solution, you can just use usual graphics software and type parameter-estimates by hand. It seems to me, that – at this point – this will generate the highest quality path diagrams.

Path diagrams consist of rectangles for observed variables, ellipses for latent variables, curves with arrow-heads on both sides for correlations and most important: straight lines with arrow-heads on one end as paths, that link a predicting and a predicted variable. Here is an example of what it could look like:

In the rest of this blog entry, i will show you examples of path diagrams:

1. Commercial Software

2. R-Packages

3. Extern graphic software

1. Solutions for automatical SEM-diagrams (commercial software)

2. Built-in solutions for SEM-diagrams (R-packages)
There are several R-packages for SEM-analysis. The fit-objects of these packages can be visualized. This list is not complete.

lavaan:

For lavaan, the best way to get path diagrams would be the semPlot-package by Sascha Epskamp (Project Homepage). Examples can be found here.

I don´t have much experience with the semPlot-package, but i think it´s offers a fast and good solution for CFA-pathdiagrams or small SEM-pathdiagram. Bigger pathdiagrams will need more work. Here´s a little example for a two-factor CFA:

pathdiagram<-semPaths(fit,whatLabels="std", intercepts=FALSE, style="lisrel",
                       nCharNodes=0, 
                       nCharEdges=0,
                       curveAdjacent = TRUE,title=TRUE, layout="tree2",curvePivot=TRUE)

sem:

For the sem-package by John Fox , there is a function named „pathDiagram()“, which produces graphviz/dot-code that can be imported in graphviz. The dot-code is a description, that defines the latent and manifest variables as nodes and the interconnections as edges of a diagram.
The semPlot-package also supports the sem-package.

OpenMX: For OpenMX, a free SEM-software that can be run via R. Exporting the model to dot-code and plotting it with graphviz is the recommended workflow.

Onyx: Onyx by Andreas Brandmaier is a free standalone SEM-tool. It offers an Amos-like graphical interface to specify the model and is capable of importing OpenMX-Code, but not lavaan-code.

Screenshots can be found here.
UPDATE:
Andreas Brandmaier wrote an experimental R-package that connects his SEM-Tool Onyx with R. It can be found here https://github.com/brandmaier/onyxR. I haven´t tried it yet, but it seems to take models from lavaan or OpenMX (R) and tries to generate pathdiagrams from it. If this works, this blogpost is complete and will be rewritten shortly.

DiagrammeR: Twitter-user @timelyportfolio (thank you!) recommended me the R-Package DiagrammeR by Richard Iannone. It doesn´t import fit-models from SEM-packages, but has it´s strengths in an easy syntax and fastly growing feature-list. I think, it´s very worthy to give it a try, because the path diagrams are not as hard to do as with graphviz but also reproducible.

UPDATE: Richard Iannone produced this example for me on stackoverflow

devtools::install_github("rich-iannone/DiagrammeR")
library(DiagrammeR)

grViz("
digraph SEM {

graph [layout = neato,
       overlap = true,
       outputorder = edgesfirst]

node [shape = rectangle]

a [pos = '-4,1!', label = 'e1', shape = circle]
b [pos = '-3,1!', label = 'ind_1']
c [pos = '-3,0!', label = 'ind_2']
d [pos = '-3,-1!', label = 'ind_3']
e [pos = '-1,0!', label = 'latent a', shape = ellipse]
f [pos = '1,0!', label = 'latent b', shape = ellipse]
g [pos = '1,1!', label = 'e6', shape = circle]
h [pos = '3,1!', label = 'ind_4']
i [pos = '3,-1!', label = 'ind_5']
j [pos = '4,1!', label = 'e4', shape = circle]
k [pos = '4,-1!', label = 'e5', shape = circle]

a->b
e->b [label = '0.6']
e->c [label = '0.6']
e->d [label = '0.6']

e->f [label = '0.321', headport = 'w']
g->f [tailport = 's', headport = 'n']

d->c [dir = both]

f->h [label = '0.6', tailport = 'ne', headport = 'w']
f->i [label = '0.6']

j->h
k->i

}
"

This produces this path-diagram:

update on DiagrammeR for SEM
Recently Tristan Mahr blogged his proof-of-concept that it´s possible to convert a lavaan-dataframe into node and edge dataframes for DiagrammeR. Wow, i´m really curious if this approach will be pursued any further.
Here is the link: https://rpubs.com/tjmahr/sem_diagrammer

The psychologist Andrey Lovakov also did an example for a SEM pathdiagram with DiagrammeR: https://github.com/lovakov/Lecturers-Org-Commitment/blob/master/Figure%201

another update on pathdiagrams in R
Stas Kolenikov from the University of Missouri did another example for SEM-pathdiagrams in R on his website http://staskolenikov.net/graphviz_sem.html. Instead of DiagrammeR he uses Graphviz. A problem he encountered concerns displaying covariances by curved two-sided arrows. It´s possible to do this, but as he writes „their aesthetic appeal is probably not that great“.

3. other / graphics software (selection)
If you want to use Graphviz or Tikz, you´ll get to very good looking diagrams, but you´ll also have to learn the „dot language“. If you have to do a lot of diagrams it can be worth learning it, but for my purposes, it´s kind of overkill.
Here are some Graphviz-Examples: pathdiagram with Graphviz

This leads us to „normal“ multi-purpose graphics software. Doing the graphs with an office-suite is pretty straightforward and selfexplaining. On the other hand, i wouldn´t trust office that everything stays in its place, when i move it around in a document.
Inkscape is a tool, that´s often mentioned by SEM-analysts. At the moment, i´m giving yed a try, which seems to be easy and produce quick and good looking graphs. Dia could also be an alternative, but i haven´t tried it, yet.

request for tipps
I´m really looking out for best practices in drawing path diagrams for structural equation models. Please leave a comment, if you know another tool, that isn´t listed, or if you have a workflow, that can be adapted by others. I think there´s a gap between working-state path-diagrams and diagrams suitable for publication.

How to apply survey weights in structural equation modeling (SEM) with lavaan.

2015-03-17 by Niels

The R-Package lavaan is my favourite tool for fitting structural equation models (SEM). Its biggest advantages: It´s free, it´s open source and its range of functions is growing steadily.
Before lavaan, i used MPLUS, which still has the widest functionality of all SEM-Tools and is the most sophisticated software for latent variable modeling. The Muthéns and their MPLUS-team offer incredibly good support and documentation. The only problem is, that the software isn´t free and without a license you can´t get any of the support.
For me, one drawback of lavaan is, that it can´t model latent class models or mixture models …yet! Yves Rosseel is planning to add this in the next two years.

lavaan stands for „latent variable analysis“. The package is available via CRAN and has a good tutorial on the lavaan project homepage. Models are specified via syntax. Thankfully, the lavaan-syntax is kept pretty simple. At least, it´s a lot easier than the LISREL-syntax (the first, and original SEM-software). But it´s not as easy as drawing a path-model in AMOS, the SPSS-module. Anyway, once you get to a little more complex models, you´ll find working with syntax a lot more efficient. If you don´t like working with syntax, i recommend having a look at Onyx – a graphical interface for structural equation modeling by Andreas Brandmaier. It´s a free tool in which you can draw your SEM as a path diagram and generate the lavaan-syntax from it.
But, when you do SEM-models the syntax will be the least complicated thing you had to learn, so i don´t think that will be a problem at all.

Install lavaan
If you want to use survey weights, you have to install lavaan, the survey package and lavaan.survey. Lavaan is the package used for modeling and the survey-package converts your data into an survey-design-object. After you specified the model in a lavaan fit object and you have generated a survey-design-object from your data, these two objects are passed to the lavaan.survey function, which will calculate the weighted model.

First, you install the packages:

#Install lavaan
install.packages("lavaan", dependencies=TRUE)
library(lavaan)

#install lavaan.survey
install.packages("lavaan.survey")
library(lavaan.survey)

#Install survey-package
install.packages("survey")
library(survey)

Generate the survey-design object
After the packages and the data are loaded, a svydesign-object is generated from our data. It´s not a suprise, that with „id=~ID“ the column „ID“ in the dataframe will be used as id-variable. With „weights= ~weights_trunc“ the column which holds the survey-weights is defined and with „data=data“ the dataframe is chosen.

library("survey") #load survey package 
data<- read.csv(file = "data.csv", header=T, sep=",") #read data

#if necessary - recode missing value "9" to NA
df[df== 9] <- NA	

#generate survey-design object
svy.df<-svydesign(id=~ID, 
                  weights=~weight_trunc,
                  data=data)

Specifying the model
I´ll use a simple structural equation model with two latent variables, measured by three and two indicator-variables. The exogenous latent variable „latent_a“ is measured by x1-x3, the endogenous latent variable „latent_b“ is measured by y1-y2. The variable „latent_b“ is regressed on (predicted by) „latent_a“.

library(lavaan)
model_1 <- '# measurement model
              latent_a =~ F09_a + F09_b + F09_c
              latent_b =~ F12_a + F12_b 

             # regressions
              latent_b ~ latent_a
            '

lavaan.fit <- sem(model_1, 
                     data=data,                      
                     estimator="MLR", # robust fit / when you have missing data
                     missing = "ml",               #fiml for missing data
                     mimic="Mplus")

#you can run the model (unweighted) at this point and inspect it
summary(lavaan.fit,fit.measures=TRUE, standardized=TRUE)

Normally, i would use MLM as estimator to get robust estimates (robust against non-normality of the endogenous variable), but in this case i chose MLR, because FIML is not available with MLM.
FIML (Full Information Maximum Likelihood algorithm- defined with missing=“ml“) is regarded as equally efficiant to multiple imputation in handling item-nonresponse. But, it can be a good idea to do multiple imputation anyway, because bootstrapping the standard errors is only available with ML-estimator. On the other Hand, it´s an advantage that with FIML it´s not necessary to explicitly model missingess, because FIML uses the already specified SEM.

When using the lavaan.survey-package, you can´t use fiml (yet). You have to do a multiple imputation for your data, if you have missings, and instead of MLR lavan.survey uses MLM as default.

Fitting the model
When the model is fitted with lavaan.survey, the covariance-matrix will be estimated using the svyvar-object generated by the survey-package . The lavaan model uses this weighted covariance-matrix with the MLM-estimator to fit the model. MLM is not compatible with missing=“fiml“, so if your data has missings you have to do multiple imputation first and pass your imputed dataframes as a list to the svydesign-package so it becomes a svy.design-object which can be used as data in lavaan.survey. The resulting parameters, fit indices and statistics will be adjusted for the sampling design. Also, if MLM is used, the chi-square (likelihood-ratio) test-statistic will be transformed to a Satorra-Bentler corrected chi-square. [This information stems from the lavaan.survey documentation]. In lavaan, you can choose the form of your output. Because i worked a lot with MPLUS, i prefer the MPLUS-Output.

library(lavaan.survey)

#Fit the model using weighted data (by passing the survey-design object we generated above)
survey.fit <- lavaan.survey(lavaan.fit, 
                            survey.design, 
                            estimator="ML") 

#inspect output
summary(survey.fit,
        fit.measures=TRUE, 
        standardized=TRUE,
        rsquare=TRUE)

# if you´re interested in descriptive statistics
# you can access the missing data patterns 
inspect(fit, 'patterns') 

# and the coverage of the covariance matrix (like in MPLUS)
inspect(fit, 'coverage')

Results
I wouldn´t have expected that using weights in a SEM-analysis with lavaan is so easy to accomplish.
Here are the fit-indices of the weighted SEM.

lavaan (0.5-17) converged normally after  24 iterations

  Number of observations                           577

  Estimator                                         ML
  Minimum Function Test Statistic               11.664
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.020

Model test baseline model:

  Minimum Function Test Statistic              955.394
  Degrees of freedom                                10
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.992
  Tucker-Lewis Index (TLI)                       0.980

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3675.100
  Loglikelihood unrestricted model (H1)      -3669.268

  Number of free parameters                         16
  Akaike (AIC)                                7382.200
  Bayesian (BIC)                              7451.926
  Sample-size adjusted Bayesian (BIC)         7401.132

Root Mean Square Error of Approximation:

  RMSEA                                          0.058
  90 Percent Confidence Interval          0.021  0.097
  P-value RMSEA <= 0.05                          0.314

Standardized Root Mean Square Residual:

  SRMR                                           0.022

…and so on. I don´t show the whole results.

It´s common to show the parameter-estimates in a path-diagram. In my next blogging-session i´ll demonstrate how to draw path diagrams of a lavaan-model with SEMPLOT (Project Homepage).

Recoding all vectors in a dataframe at once (R)

2015-03-11 by Niels

This post is just a little reminder for myself, because i had to look this up a few times.
Version 1

#Before:
[1] 5 4 3 5 3 4

#After:
[1] Stimme gar nicht zu  
Stimme eher nicht zu 
Teils / teils       
Stimme gar nicht zu  
Teils / teils        
Stimme eher nicht zu

I use „recode“ from the car package, to recode the variables. This function will be used with „apply“ so it will recode all vectors in the dataframe.

Here´s how it´s done:

df<- apply(df, 2, function(x) {
  x <- car::recode(x,"1='Stimme voll und ganz zu'; 2='Stimme eher zu';3='Teils / teils';4='Stimme eher nicht zu';5='Stimme gar nicht zu';9=NA"); x
  })
df<-as.data.frame(df)

Version 2
If you only want to recode some of the variables in you dataframe, you can define them in a list and use this list in a for-loop.

library(car)
var.list<-c("var1","var3","var5")
for (v in var.list)
  df[[v]]<-recode(df[[v]], "1=5;2=4;4=2;5=1")

Version 3
If the values just have to be reversed, there´s an even simpler way:

recode.list<-c("variable1","variable2") 
df[recode.list] <- 6 - df[recode.list]

Diverging stacked barchart for plotting likert Items

2015-03-09 by Niels

Questionnaires in the social sciences often include rating items to measure the variability of peoples´ attitudes towards something. Respondents are given a statement and have to report how much they agree or disagree on a 5- or 7-point-scale. A set of rating-items like these can be combined to a likert-scale. It´s also common to build an index-value for the respondents, if the items meet certain criteria of quality. There´s still some controversy, if it´s adequate to use ranked (ordinal) data like likert-items to calculate means. Most researchers think it´s approriate if the scale has at least 5 points and the variable can be considered as an ordinal measure of a continuous attitude.

Anyway. Visualization of the data ist always a good starting point. For this purpose, there are a lot of R-Packages like the HH-Package with its Likert-Function, or the likert-package from Jason Bryer and last, but not least: The sjp.likert-Function from Daniel Lüdecke, which would be my favourite.

All these packages produce sophisticated and very appealing plots. Under its hood, the HH-package uses lattice and the likert and sjPlot package are build on ggplot2. I tried HH-package, but as a ggplot2-user i realized, it would take me too long to figure out the little details. The other two packages could do what i want, but they both need raw-data (SPSS-like) and can´t work with already aggregated data. Both also have distinct kinds of dealing with the „neutral“-category of the items.
Long story short, i decided to use ggplot2 directly instead of using packages build on ggplot2 that have developed a lot of complexity on their own.

The Plot
This plot is a small example. If the code seems too messy to you, or you think the plot can be improved: i´m always interested in how to make things better, please leave a comment.
For example, one could criticize, that the x-axis isn´t meaningful, because of the neutral-category should not be splitted in negative/positive like this. So perhaps, the vertical line and the x-axis-labels should be removed. On the other hand, the HH-Plot likert-function does it the same way. It would be possible to add percentage-values inside the stacked bars, but i think that would be too much. I decided, to make a stacked-frequency table with the sjPlot-Package to complement my likert-plot.

And this is the code, i´ve written:

library("plyr")
library("dplyr")
library("ggplot2")

# example data
Variable<-c("1","1","1","1","1","2","2","2","2","2","3","3","3","3","3","4","4","4","4","4")
level<-c(5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1)
perc_w<-c(3.70,11.80,10.10,25.80,38.60,2.00,16.90,13.25,28.80,25.80,1.80,6.50,9.35,33.60,39.40,3.50,12.40,14.10,34.80,21.10)
df<-data.frame(Variable,level,perc_w)
df$perc_w<-as.numeric(df$perc_w)
df$level<-as.factor(df$level)

# item text
items<-c("~ It´s not known, if climate change is real",
         "~ In my opinion, the risks of climate change are exaggerated by activists",
         "~ Climate change is not as dangerous as it is claimed", 
         "~ I´m convinced that we can handle climate change")

df$Variable<-as.character(df$Variable)
df$Variable[df$Variable==1]<-items[1]
df$Variable[df$Variable==2]<-items[2]
df$Variable[df$Variable==3]<-items[3]
df$Variable[df$Variable==4]<-items[4]
df$Variable<-as.ordered(df$Variable)

# calculate halves of the neutral category
df.split <-df %>% filter(level==3) %>% mutate(perc_w=as.numeric(perc_w/2)) 

# replace old neutral-category
df<-df %>% filter(!level==3)   
df<-full_join(df,df.split) %>% arrange(level)  %>% arrange(desc(Variable)) 


#split dataframe
df1<-df %>% filter(level == 3 | level== 2 | level==1) 
df2<-df %>% filter(level == 5 | level== 4 | level==3) %>% mutate(perc_w = perc_w *-1)

# automatic line break
df1$Variable  <-str_wrap(df1$Variable, width = 41) 
df2$Variable  <-str_wrap(df2$Variable, width = 41) 

# reorder factor "Variable"
df1$Variable   <- factor(df1$Variable, levels=rev(unique(df1$Variable)))
df2$Variable   <- factor(df2$Variable, levels=rev(unique(df2$Variable)))

#Plot  
p<-ggplot() +
  geom_bar(data=df1, aes(x = Variable, y=perc_w, fill = level, order = -as.numeric(level)),position="stack", stat="identity") +
  geom_bar(data=df2, aes(x = Variable, y=perc_w, fill = level, order = as.numeric(level)),position="stack", stat="identity") +
  geom_hline(yintercept = 0, color =c("black"))+
  theme_bw() + 
  coord_flip() +
  guides(fill=guide_legend(title="",reverse=TRUE)) +
  scale_fill_brewer(palette="Blues", name="",labels=c("--","-","0","+","++")) +
  labs(title=expression(atop(bold("Attitudes towards climate change"),
                             atop(italic("Some roughly translated items"),""))),
       y="percentages",x="") +  
  theme(legend.position="top",
        axis.ticks = element_blank(), 
        plot.title = element_text(size=25),
        axis.title.y=element_text(size=16),
        axis.text.y=element_text(size=13),
        axis.title.x=element_text(size=16),
        axis.text.x=element_text(size=13),
        legend.title=element_text(size=14),
        legend.text=element_text(size=12)    
  )
p

analysis, visualisation and playing around with data

Month: März 2015

Drawing path diagrams of structural equation models (SEM) for publication

How to apply survey weights in structural equation modeling (SEM) with lavaan.

Recoding all vectors in a dataframe at once (R)

Diverging stacked barchart for plotting likert Items