ahoi data – analysis, visualisation and playing around with data

Multiple Imputation in R. How to impute data with MICE for lavaan.

2017-01-14 by Niels

Missing data is unavoidable in most empirical work. This can be a problem for any statistical analysis that needs data to be complete. Structural equation modeling and confirmatory factor analysis are such methods that rely on a complete dataset. The following post will give an overview on the background of missing data analysis, how the missingness can be investigated, how the R-package MICE for multiple imputation is applied and how imputed data can be given to the lavaan-package for confirmatory factor analysis.

If you are in a hurry and already know the background of multiple imputation, jump to: How to use multiple imputation with lavaan

What kinds of missing data are there?
There are two types of missingness: Unit nonresponse concerns cases in the sample, that didn´t respond to the survey at all, or – more generally spoken – the failure to obtain measurements for a sampled unit. Item nonresponse occurs, when a person leaves out particular items in the survey, or – more generally spoken – particular measurements of a sampled unit are missing. Here, we will focus on item nonresponse.

Why is it important?
The topic of missing data itself is still often missing in the curriculum of statistics for social sciences and sociology. Also in practical research a lot of studies don´t show transparently how they handled missing data. But there would be a lot reason to pay more attention to this issue. As an example, Ranjit Lall examined how political science studies dealed with missing data and found out, that 50 % had their key results „disappear“ after he re-analysed them with a proper way to handle the missingness: How multiple Imputation makes a difference. Most of these studies used listwise deletion, because it once was a standard way to deal with missings and still is in many software packages. For example, the statistic software SPSS still doesn´t offer multiple imputation (only single imputation with EM-algorithm, that doesn´t incorporate uncertainty and should only be used with a trivial amount of missingness of < 5 %). Regarding the state of the art right now, any researcher should take the following in consideration:

DON´T (bad practice)
In listwise deletion every observation (every row in the dataset respectively every person in the survey) that has at least one missing value will be dropped completely out of the analysis. Only complete cases are analysed. Another way is pairwise deletion, which often is used for correlations. Here, all cases without missings in the analysed variables are included. The problem is, that if you run a correlation of variable a and variable b, and a correlation of variable a and variable c, your results can be based on a different amount of cases (N).
Listwise and pairwise deletion are problematic in multiple ways: both reduce your samplesize and your statistical power decreases. Other studies acknowledge this problem and replace missing values with the mean value of the remaining datapoints (mean value replacement). This is problematic as well, because your standard deviation increases and your results become biased as well. It is still more accepted than listwise or pairwise deletion and has the convenience of having a single dataset for analysis.

DO: (state of the art)
The state of the Art methods of dealing with missing data (at least in structural equation modeling) are multiple imputation as well as full information maximum likelihood (FIML). In FIML no data is imputed. Instead, an algorithm is used in your analysis (i.e. regression, structural equation modelling) that estimates your model and the missing values in one step, based on your model and all observed data in your sample. FIML should not be confused with EM-Imputation.
In multiple imputation each missing value is replaced (imputed) multiple times through a specified algorithm, that uses the observed data of every unit to find a plausible value for the missing cell. Every time a missing value is replaced through an estimated value, some uncertainty/randomness is introduced. This way, each of the resulting datasets differs a little bit, which brings the advantage of a more adequate estimation of variances.

How to use multiple imputation in practice
It is the decision of the researcher how many times the cells with missing data are imputed. There are rules of thumb and simulation studies to guide this decision. Often a minimum of 5 imputed datasets is enough, but some researchers think it should depend on the amount of missingness. At some point a greater number of imputation becomes obsolete. Depending on the number of variables and number of observations and the speed of your computer, it can take some hours to complete the calculations.

Multiple Imputation needs multivariate normality of the data and the missings ´should at least be MAR (missing at random). Simulation studies showed, that deviation of multivariate normality is not too problematic and even if the data is not MAR, multiple imputation showed itself as robust. Especially in comparison to listwise or pairwise deletion, multiple imputation produces more adequate results in spite of erroneous assumption of MAR or multivariate normality.

There are a lot of tools to do multiple imputation: Here is a list of multiple imputation software. The standalone Software NORM now also has an R-package NORM for R (package). Another R-package worth mentioning is Amelia (R-package). Now, we turn to the R-package MICE („multivariate imputation by chained equations“) which offers many functions to generate imputed datasets based on your missing data. MICE uses the pmm algorithm which stands for predictive mean modeling that produces good results with non-normal data. To be precise: Which algorithm is used for imputation depends on the variable and the decision of the analyst. We´ll come back to this later.

Three types of missingness
Before you start with imputing your data, you should check the type of missingness in your data. This is kind of a paradox. How can you say what pattern the missingness has, if you don´t know which values are missing? If you knew, you wouldn´t need to impute them, right? The Values could be missing just at random. Or it could be, that people with specific values on the variable in question chose to decline the answer. People with extreme values could be underrepresentated. It is our goal to make it plausible, that our missing Items are at least „missing at random“ (MAR).

There are three possible patterns of missingness:
– MCAR (Missing completely at random)
– MAR (Missing at random)
– NMAR (Not missing at random)

What are the reasons of missing data in particular cells of the dataset? It could happen in (manual) data entry or when people miss a question, because they were distracted. But it could also be, that a person refuses to answer a question, doesn´t have the knowledge, cognitive abilities or motivation to answer it, or the question itself is unclear. It is especially problematic if missing values are related to the (unobserved) value of the person in this variable. A typical example would be, that people refuse to answer questions on their income if it exceeds a certain amount. Or if you ask for the number of sex partners a person had and people with high numbers don´t answer it. In this case your data is not missing at random.
If your data is missing completely at random, there is no correlation between the missingness and the value the person would have, if there was a datapoint. To find out if your data is MCAR there is a statistical test called „little´s mcar test“, which tests the null hypothesis that data is completely missing at random. So you want it to be nonsignificant. Problem is, that it’s an omnibus test. It doesn’t tell you for each variable if its missingness is mcar, but only for a set of variables. A part of your data might be mcar, but another part not. The little-MCAR-Test will only test all data and discard MCAR. Also, it has assumptions like normality, so if your data doesn’t meet them, the test might tell you it’s not mcar even if it is. The “MissMech” package in R has tests to show if assumptions are met. Little´s MCAR-test is part of the „BaylorEdPsych“ Package. Please notice, that a maximum of 50 variables can be tested at once. I quess that this is an arbitrary value and it just doesn´t make sense, to perform Little´s MCAR-Test on more variables, because it would be most likely to become significant.

There is critic towards the naming of the missingness-patterns. If missings are random, they are random. There is no sense in saying they miss „completely at random“. That´s why some people argue, that MCAR should just be named MAR. MAR on the other hand should be called MCAR, but with the letters staying for Missing CONDITIONALLY at random, because that´s what MAR (in its original meaning) is about. But, that critic won´t change the differentiation of MCAR, MAR and NMAR because they are already a scientific convention.

Let´s do a „Little´s test“ on MCAR:

#--------------------------------------------------
# Little-test
#--------------------------------------------------
install.packages('BaylorEdPsych', dependencies=TRUE)
library(BaylorEdPsych)

# read example data
data(EndersTable1_1)

# run MCAR test
test_mcar<-LittleMCAR(EndersTable1_1) 

# print p-value of mcar-test
print(test_mcar$p.value)

As a result we get

print(test_mcar$p.value)
[1] 0.01205778

which means, that the result is significant. The null-hypotheses, that our data is mcar, is rejected. Data is mcar if p > 0.05. There is a possibility, that the test failed, because the data are not normal and homoscedastic. We test this:

install.packages("MissMech")
library("MissMech")

#test of normality and homoscedasticity
out<-TestMCARNormality(EndersTable1_1)
print(out)

The Output:

Call:
TestMCARNormality(data = EndersTable1_1)

Number of Patterns:  2 

Total number of cases used in the analysis:  17 

 Pattern(s) used:
          IQ   JP   WB   Number of cases
group.1    1   NA    1                 8
group.2    1    1    1                 9


    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  0.4088643 

    There is not sufficient evidence to reject normality
    or MCAR at 0.05 significance level

So the results of the test of MCAR for homogenity of covariances show us, that mcar was not rejected because of non-normality or heteroscedasticity. If the Hawkins-test becomes significant, the „MissMech“-package performs a nonparametric test on homoscedasticity. This way, it can show through the method of elimination if non-normality or heteroscedasticity is a problem.

OK. Back to our patterns of missingness: Our data is not MCAR, but that´s not too bad, because we only need our data to be MAR (Missing at random). MAR isn’t testable like mcar. If your data isn’t MCAR you can try to make plausible that your data is MAR through visualisation of the missingness pattern. Or you can show that missingness depends on other variables (like gender or sth else). If you find out that this is the case, you can include them as auxiliary variables in your imputation model. It´s best to have side-variables like socio-demographics from register-data that can be used to show if they are relevant for missingness.
You can create a dummy variable for missingness and use a t-test or chi-squre test to look for differences in other variables depending on the dummy variable groups.
If your data is NMAR (Not missing at random) you cannot ignore the missings and imputation is not an option. You then have to find a way of analysing your data adquately.

Visualisation of missing data patterns
First, we inspect the amount of missingness for every variable in our dataset.

library("dplyr")
#First: Check your missings:
# Proportion of Missingness
propmiss <- function(dataframe) {
  m <- sapply(dataframe, function(x) {
    data.frame(
      nmiss=sum(is.na(x)), 
      n=length(x), 
      propmiss=sum(is.na(x))/length(x)
    )
  })
  d <- data.frame(t(m))
  d <- sapply(d, unlist)
  d <- as.data.frame(d)
  d$variable <- row.names(d)
  row.names(d) <- NULL
  d <- cbind(d[ncol(d)],d[-ncol(d)])
  return(d[order(d$propmiss), ])
}

miss_vars<-propmiss(EndersTable1_1)
miss_vars_mean<-mean(miss_vars$propmiss)
miss_vars_ges<- miss_vars  %>% arrange(desc(propmiss)) 

plot1<-ggplot(miss_vars_ges,aes(x=reorder(variable,propmiss),y=propmiss*100)) + 
  geom_point(size=3) +
  coord_flip() + 
  theme_bw() + xlab("") +ylab("Missingness per variable") +
  theme(panel.grid.major.x=element_blank(),
        panel.grid.minor.x=element_blank(),
        panel.grid.major.y=element_line(colour="grey60",linetype="dashed")) +
  ggtitle("Percentage of missingness")
plot1

There is no general rule on how much missing data is acceptable. It depends on your research context and samplesize. Sometimes 20 % shouldn´t be exceeded, sometimes more than 40 % missings are not tolerable and sometimes 5 % missings is too much. You should check all cases with the most amount of missingness, if the person did the survey conscientious and if its data does add value to the quality of your dataset.
I usually inspect amount of missingness per variable and per person. Often more than 90 % of participants have less then 10 % missings, but two or three cases have as much as 50 % missings. Concerning the variables, you should check every variable with more than 5 % missingness. Did you have a neutral category? Was the question problematic? Too personal? Too difficult? Questions like this normally are answered in a pretest.

Now, we´ll use the VIM package to visualize missings and if there are any patterns.

install.packages("VIM", dependencies = TRUE)
install.packages("VIMGUI", dependencies = TRUE)
library("VIM")
library("VIMGUI")
VIMGUI()

# If you don´t like to use the GUI because of reproducibility, you can also use the console:
aggr(EndersTable1_1, numbers=TRUE, prop=TRUE, combined=TRUE, sortVars=FALSE, vscale = 1)

After we chose our dataframe from the environment, VIM gives us some plots to visualise our data:

Aggregation Plot

marginplot

Visualisiations like these show you, if there are a lot of different missing data patterns (~ random) or if there is some kind of systematics. The MICE-package can show missingness patterns as well:

install.packages("mice")
library(mice)
md.pattern(EndersTable1_1)

  IQ WB JP   
9  1  1  1  0
8  1  1  0  1
1  1  0  1  1
2  1  0  0  2
   0  3 10 13

If you can make it plausible your data is mcar (non-significant little test) or mar, you can use multiple imputation to impute missing data. Generate multiple imputed data sets (depending on the amount of missings), do the analysis for every dataset and pool the results according to rubins rules.

How to use MICE for multiple imputation
With MICE you can build an imputation model that is tailored for your dataset. At first this can be a little overwhelming, so we start easy. Just use "mice()" with your dataframe and use the defaults of the package.

imp <- mice(EndersTable1_1)
imp
summary(imp)



> imp <- mice(EndersTable1_1)
 iter imp variable
  1   1  JP  WB
  1   2  JP  WB
  1   3  JP  WB
  1   4  JP  WB
  1   5  JP  WB
  2   1  JP  WB
  2   2  JP  WB
  2   3  JP  WB
  2   4  JP  WB
  2   5  JP  WB
  3   1  JP  WB
  3   2  JP  WB
  3   3  JP  WB
  3   4  JP  WB
  3   5  JP  WB
  4   1  JP  WB
  4   2  JP  WB
  4   3  JP  WB
  4   4  JP  WB
  4   5  JP  WB
  5   1  JP  WB
  5   2  JP  WB
  5   3  JP  WB
  5   4  JP  WB
  5   5  JP  WB
> imp
Multiply imputed data set
Call:
mice(data = EndersTable1_1)
Number of multiple imputations:  5
Missing cells per column:
IQ JP WB 
 0 10  3 
Imputation methods:
   IQ    JP    WB 
   "" "pmm" "pmm" 
VisitSequence:
JP WB 
 2  3 
PredictorMatrix:
   IQ JP WB
IQ  0  0  0
JP  1  0  1
WB  1  1  0
Random generator seed value:  NA

MICE generates 5 imputated datasets using an algorithm called "predictive mean matching" (pmm), because all data are "numeric" in this case. Pmm has the advantage of finding robust values if the data don´t follow a normal distribution.
If there was binary data like a factor with 2 levels MICE would have chosen "logistic regression imputation (logreg). If there was an unordered factor with more than 2 levels, MICE would have used "polytomous regression imputation for unordered categorical data" (polyreg). And if there were missings in a variable with more than 2 ordered levels, MICE would have used "proportional odds model" (polr).

There are many other algorithms for imputation that can be specified:

#Built-in elementary imputation methods are:

pmm
Predictive mean matching (any)

norm
Bayesian linear regression (numeric)

norm.nob
Linear regression ignoring model error (numeric)

norm.boot
Linear regression using bootstrap (numeric)

norm.predict
Linear regression, predicted values (numeric)

mean
Unconditional mean imputation (numeric)

2l.norm
Two-level normal imputation (numeric)

2l.pan
Two-level normal imputation using pan (numeric)

2lonly.mean
Imputation at level-2 of the class mean (numeric)

2lonly.norm
Imputation at level-2 by Bayesian linear regression (numeric)

2lonly.pmm
Imputation at level-2 by Predictive mean matching (any)

quadratic
Imputation of quadratic terms (numeric)

logreg
Logistic regression (factor, 2 levels)

logreg.boot
Logistic regression with bootstrap

polyreg
Polytomous logistic regression (factor, >= 2 levels)

polr
Proportional odds model (ordered, >=2 levels)

lda
Linear discriminant analysis (factor, >= 2 categories)

cart
Classification and regression trees (any)

rf
Random forest imputations (any)

ri
Random indicator method for nonignorable data (numeric)

sample
Random sample from the observed values (any)

fastpmm
Experimental: Fast predictive mean matching using C++ (any)

You can decide for each of your variables which imputation-algorithm is used. First you should make sure, every variable has the right type:

str(EndersTable1_1)

> str(EndersTable1_1)
'data.frame':	20 obs. of  3 variables:
 $ IQ: int  78 84 84 85 87 91 92 94 94 96 ...
 $ JP: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WB: int  13 9 10 10 NA 3 12 3 13 NA ...

In this case every variable has the type integer. Just as an example, we assume that variable "WB" is an ordered factor.

EndersTable1_1$WB<-as.factor(EndersTable1_1$WB)
str(EndersTable1_1)

data.frame':	20 obs. of  3 variables:
 $ IQ: int  78 84 84 85 87 91 92 94 94 96 ...
 $ JP: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WB: Factor w/ 8 levels "3","6","9","10",..: 7 3 4 4 NA 1 6 1 7 NA ...

Now we can use the argument "method = c('','pmm','polr')" in the mice()-call to specify the imputation algorithm for each variable.

As a default MICE also uses every variable in the dataset to estimate the missing values. This is usually called a "massive imputation". This can also be problematic, because variables that don´t correlate with the variable, that will be imputed, it only adds noise to the estimation. Leaving such a variable out of the imputation model can improve data quality. There is an easy way to build a "predictor matrix" using quickpred():

predictormatrix<-quickpred(EndersTable1_1, 
                           include=c("IQ"),
                           exclude=NULL,
                           mincor = 0.1)

Here, i force MICE to include the Variable "IQ" in the predictor matrix. No variable is excluded a priori, but with "mincor = 0.1" i decide to only use variables as predictor in the imputation model, that are correlated with at least r=0.1 with the target-variable. Variables that are very weakly correlated are now left out. Also, your estimates can be biased if you include too many auxiliary variables.

Now comes an example for a more tailored imputation model. It is really just a simple demonstration. The imputation model should always be specifically be made for your dataset. First we build a predictormatrix, then we make sure, every variable is of the right type and then, we let mice generate 10 imputed datasets based on the algorithms we specified in the "method = " argument.

set.seed(121012)
predictormatrix<-quickpred(EndersTable1_1, 
                           include=c("IQ"),
                           exclude=NULL,
                           mincor = 0.1)
str(EndersTable1_1)
EndersTable1_1<-as.data.frame(lapply(EndersTable1_1,as.numeric))
EndersTable1_1$WB<-as.factor(EndersTable1_1$WB)
str(EndersTable1_1)

imp_gen <- mice(data=EndersTable1_1,
                 predictorMatrix = predictormatrix,
                 method = c('pmm','pmm','polr'),
                 m=10,
                 maxit=5,            
                 diagnostics=TRUE,
                 MaxNWts=3000)

Now, we inspect the imputed values and save the imputed datasets in one file.

# Check plausibility of the results
#Variable JP
imp_gen$imp$JP
nrow(imp$imp$JP)

#Variable WB
imp_gen$imp$WB
nrow(imp$imp$WB)


# bring your imputed data in long format (first colum ",.imp" is the number of imputation, the second column ".id" is the id of case)
imp_data<-mice::<-complete(imp_gen,"long",inc=FALSE)  

# save data
write.table(comp10_neu,file="/imp_test.csv",sep=";")

How to use Multiple Imputation with lavaan
There are three ways to use multiple imputation in lavaan. The first (i) uses runMI() to do the multiple imputation and the model estimation in one step. The second (ii) does the multiple imputation with mice() first and then gives the multiply imputed data to runMI() which does the model estimation based on this data. Since both ways use runMI() they run the analysis multiple times for each imputed dataset and then use rubins rules to pool the results. Here is a diagram, showing the principle:

The third way (iii) uses the lavaan.survey()-package. In this example we don´t specify any sampling design or survey weight, but if you need to, it is possible. Here, you first use mice() to do the multiple imputation (if you use a survey weight, be sure to include it in the model) and then pass the imputed data to the survey-package and generate a svydesign()-object. This svydesign()-object can itself be passed to lavaan.survey, together with the lavaan-model. The way Lavaan.survey() uses multiple imputed data differs from runMI(). Here, not the results for each dataset are pooled after analysis, but the datasets are pooled first (to be precise: the variance-covariance are first calculated, taking account of the sampling design and then the matrices are pooled, which are the data basis for model estimation) and then only one dataset is analysed. The results can differ somewhat, but tend to be the same. Of course, you only use lavaan.survey() if you need to incorporate weights or a sampling design. It is evidend, that it will give more adequate results than using runMI() and omitting the weights, even though the pooling does not happen in the typical order.

Example:

#--------------------------
# Setting up packages
#--------------------------
install.packages("semTools","lavaan")
install.packages("survey")
install.packages("lavaan.survey")
install.packages("mitools")
install.packages("mice")
library("survey")
library("mice")
library("mitools")
library("semTools")
library("lavaan")
library("lavaan.survey")

#--------------------------
# Setting up example data and model
#--------------------------

# Create data with missings
set.seed(20170110)
HSMiss <- HolzingerSwineford1939[,paste("x", 1:9, sep="")]
randomMiss <- rbinom(prod(dim(HSMiss)), 1, 0.1)
randomMiss <- matrix(as.logical(randomMiss), nrow=nrow(HSMiss))
HSMiss[randomMiss] <- NA

# lavaan model
HS.model <- ' visual  =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed   =~ x7 + x8 + x9 '

#------------------------------------------------------------------
# Variant 1: Imputation and model estimation with runmI
#-------------------------------------------------------------------

# run lavaan and imputation in one step
out1 <- runMI(HS.model, 
              data=HSMiss,
              m = 5, 
              miPackage="mice",
              fun="cfa",
              meanstructure = TRUE)
summary(out1)
fitMeasures(out1, "chisq")

At the moment you´ll get a warning:

** WARNING ** lavaan (0.5-22) model has NOT been fitted
** WARNING ** Estimates below are simply the starting values

You should just ignore it, because those warnings are side-effects from semtools and don´t have any meaning.

#------------------------------------------------------------------
# Variant 2: Imputation in step 1 and model estimation in step 2 with runMI
#-------------------------------------------------------------------

# impute data first
HSMiss_imp<-mice(HSMiss, m = 5)
mice.imp <- NULL
for(i in 1:5) mice.imp[[i]] <- complete(HSMiss_imp, action=i, inc=FALSE)  

# run lavaan with previously imputed data using runMI
out2 <- runMI(HS.model, 
              data=mice.imp,
              fun="cfa",
              meanstructure = TRUE)
summary(out2)
fitMeasures(out2, "chisq")

Here, we did the multiple imputation with mice() first and then passed the data to runMI(). In the first model we had a chisq of 73.841 and now chisq is 78.752. This might be, because of different imputation models.

#------------------------------------------------------------------
# Variant 3: Imputation in step 1 and model estimation in step 2 with lavaan.survey (but without weights)
#-------------------------------------------------------------------

# take previously imputed data from variant 2 and convert it to svydesign-object
mice.imp2<-lapply(seq(HSMiss_imp$m),function(im) complete(HSMiss_imp,im))
mice.imp2<-mitools::imputationList(mice.imp2) 
svy.df_imp<-survey::svydesign(id=~1,weights=~1,data=mice.imp2)                    #survey-Objekt erstellen 


# fit model with lavaan.survey
lavaan_fit_HS.model<-cfa(HS.model, meanstructure = TRUE)
out3<-lavaan.survey(lavaan_fit_HS.model, svy.df_imp)
summary(out3)
fitMeasures(out3, "chisq")

In this last model, the chisq is 96.748, which is somewhat higher than in model 1 (chisq = 73.841) or 2 (chisq = 78.752). That is due to the different pooling strategies. But, as i said before, using lavaan.survey() without weights or else, does not make sense. And if you need weights, using runMI() is no option.

Of course, you can also use the FIML-Method and just use the dataset with the missings. FIML does not work with lavaan.survey(), only with lavaan().

#------------------------------------------------------------------
# Variant 4: Use FIML (Full Information Maximum Likelihood) instead of multiple imputation
#-------------------------------------------------------------------

# fit model lavaan using FIML 
out4<-cfa(HS.model, data=HSMiss, missing="FIML", meanstructure = TRUE)
summary(out4)
fitMeasures(out4, "chisq")

The chisquare is 76.13, which isn´t very different from the first to methods.

FIML is definitely easier to apply than multiple imputation, because you don´t have to work out an imputation model. On the other hand, you can´t specify an imputation model, which could come handy if your data is MAR and you want to include certain auxiliary variables. Also, if you decide to use lavaan.survey, you cannot use FIML, because it only supports multiply imputed data.

Whereabouts of observations between multiple latent class models. Supplementary plot for LCA with poLCA.

2015-08-18 by Niels

I got the idea for the following plot and some of the code from a Stackoverflow question, where User D.L. Dahly tried to show how observations in „a model with class=(i) are distributed by the model with class = (i+1)“. I contribute through the idea of not using igraph, but the DiagrammeR-package, which generates an appealing plot with little code.

The plot tries to visualize how classifications of observations (persons) in a latent class analysis change over a sequence of LC-models with growing number of classes. I ran five models with one to five classes. The plot starts on top with the loglinear independence model that only has one class. The sample then splits in the 2-class LCA in a class with 146 and a class of 436 observations. Ellipse two and three are the classes 1 and 2 from the latent class model with two classes. In the next line of ellipses (four,five and six) you find the classes 1,2 and 3 of the latent class model with three classes. Ellipses seven, eight, nine, ten are classes 1,2,3 and 4 from the 4-class latent class model. The thickness of the ellipses and the arrows is according to the amount of observations.

Here is the R-code for it:


# first: estimate 5 latent class models 
f<-with(mydata, cbind(var1:varx)~1)
lc1<-poLCA(f, data=mydata, nclass=1, na.rm = FALSE, nrep=30, maxiter=3000) #Loglinear independence model.
lc2<-poLCA(f, data=mydata, nclass=2, na.rm = FALSE, nrep=30, maxiter=3000)
lc3<-poLCA(f, data=mydata, nclass=3, na.rm = FALSE, nrep=30, maxiter=3000)
lc4<-poLCA(f, data=mydata, nclass=4, na.rm = FALSE, nrep=30, maxiter=3000) 
lc5<-poLCA(f, data=mydata, nclass=5, na.rm = FALSE, nrep=30, maxiter=3000)

#---------------------------------
# PLOT
#---------------------------------
library("DiagrammeR")
library("V8")
# This code stems from D.L. Dahly 

# build dataframe with predicted class for each observation
x1<-rep(1, nrow(lc1$predclass))        
x2<-lc2$predclass
x3<-lc3$predclass
x4<-lc4$predclass
x5<-lc5$predclass
results <- cbind(x1, x2, x3, x4, x5)
results <-as.data.frame(results)
results

# avoid double naming of classes (because each LCA named their classes 1,2,...,k)
N<-ncol(results) 
n<-0
for(i in 2:N) {
  results[,i]<- (results[,i])+((i-1)+n)
  n<-((i-1)+n)
}

# Make a data frame for the edges and counts
# cross-tabulations and their frequencies
g1<-plyr::count(results,c("x1","x2"))
g2<-plyr::count(results,c("x2","x3"))
colnames(g2)<-c("x1","x2","freq")
g3<-plyr::count(results, c("x3","x4"))
colnames(g3)<-c("x1", "x2","freq")
g4<-plyr::count(results,c("x4","x5"))
colnames(g4)<-c("x1","x2","freq")
edges<-rbind(g1,g2,g3,g4)

# Make a data frame for the class sizes
h1<-plyr::count(results,c("x1"))
h2<-plyr::count(results,c("x2"))
colnames(h2)<-c("x1","freq")
h3<- plyr::count(results,c("x3"))
colnames(h3)<-c("x1","freq")
h4<-plyr::count(results,c("x4"))
colnames(h4)<-c("x1","freq")
h5<-plyr::count(results,c("x5"))
colnames(h5)<-c("x1", "freq")
nodes<-rbind(h1,h2,h3,h4,h5)

Now, we use the data from edges and counts, as well as class sizes in DiagrammeR:

#dataframe for nodes - columns: node, label, type, attributes (like color and stuff)
colnames(nodes)<-c("node","label")

#scale nodes
nodes <- scale_nodes(nodes_df = nodes,
                     to_scale = nodes$label,
                     node_attr = "penwidth",
                     range = c(2, 5))

#dataframe for edges - columns: edge from, edge to, label, relationship, attributes 
colnames(edges)<-c("from", "to", "label")
edges$relationship<-c("given_to")

#scale edges
edges <- scale_edges(edges_df = edges,
                     to_scale = edges$label,
                     edge_attr = "penwidth",
                     range = c(1, 5))

nodes <- scale_nodes(nodes_df = nodes,
                     to_scale = nodes$penwidth,
                     node_attr = "alpha:fillcolor",
                     range = c(5, 90))

nodes
nodes$label2<-nodes$label
nodes$label<-paste0(nodes$node)

# Additional label outside of the ellipses
# nodes$label<-paste0(nodes$node, "',xlabel=","'",nodes$label2) 

# Group-number
#nodes$xlabel<-paste0("(n=",nodes$label2,")")

#plot stuff 
lca_graph<-create_graph(nodes,
                        edges,
                        node_attrs = c("fontname = Helvetica",
                                       "color = darkgrey",
                                       "style = filled",
                                       "fillcolor = lightgrey",
                                       "alpha_fillcolor = 0.5"),
                        edge_attrs = c("fontname = Helvetica",
                                       "fontsize=10"),
                        graph_attrs=c("layout=dot",
                                      "overlap = false",
                                      "fixedsize = true",
                                      "directed=TRUE"))
                                      
render_graph(lca_graph)

That´s it. DiagrammeR uses an algorithm to avoid overlapping. I tried some improvements of the plot, but decided to stick with this solution, because it´s already pretty nice. The only thing i miss in the plot are class-sizes. I tried to attach them with the „xlabel“-attribute in DiagrammeR, but the plot became to messy. You can try it yourself, by uncommenting this part:

nodes$label<-paste0(nodes$node, "',xlabel=","'",nodes$label2)

But i didn´t like it much.

Example for a latent class analysis with the poLCA-package in R

2015-08-13 by Niels

When you work with R for some time, you really start to wonder why so many R packages have some kind of pun in their name. Intended or not, the poLCA package is one of them. Today i´ll give a glimpse on this package, which doesn´t have to do anything with dancing or nice dotted dresses.

This article is kind of a draft and will be revised anytime.

The „poLCA“-package has its name from „Polytomous Latent Class Analysis“. Latent class analysis is an awesome and still underused (at least in social sciences) statistical method to identify unobserved groups of cases in your data. Polytomous latent class analysis is applicable with categorical data. The unobserved (latent) variable could be different attitude-sets of people which lead to certain response patterns in a survey. In marketing or market research latent class analysis could be used to identify unobserved target-groups with different attitude structures on the basis of their buy-decisions. The data would be binary (not bought, bought) and depending on the products you could perhaps identify a class which chooses the cheapest, the most durable, the most environment friendly, the most status relevant […] product. The latent classes are assumed to be nominal. The poLCA package is not capable of sampling weights, yet.

By the way: There is also another package for latent class models called „lcmm“ and another one named „mclust„.

What does a latent class analysis try to do?
A latent class model uses the different response patterns in the data to find similar groups. It tries to assign groups that are „conditional independent“. That means, that inside of a group the correlations between the variables become zero, because the group membership explains any relationship between the variables.

Latent class analysis is different from latent profile analysis, as the latter uses continous data and the former can be used with categorical data.
Another important aspect of latent class analysis is, that your elements (persons, observations) are not assigned absolutely, but on probability. So you get a probability value for each person to be assigned to group 1, group 2, […], group k.

Before you estimate your LCA model you have to choose how many groups you want to have. You aim for a small number of classes, so that the model is still adequate for the data, but also parsimonious.
If you have a theoretically justified number of groups (k) you expect in your data, you perhaps only model this one solution. A typical assumption would be a group that is pro, one group contra and one group neutral to an object. Another, more exploratory, approach would be to compare multiple models – perhaps one with 2 groups, one with 3 groups, one with 4 groups – and compare these models against each other. If you choose this second way, you can decide to take the model that has the most plausible interpretation. Additionally you could compare the different solutions by BIC or AIC information criteria. BIC is preferred over AIC in latent class models, but usually both are used. A smaller BIC is better than a bigger BIC. Next to AIC and BIC you also get a Chi-Square goodness of fit.
I once asked Drew Linzer, the developer of PoLCA, if there would be some kind of LMR-Test (like in MPLUS) implemented anytime. He said, that he wouldn´t rely on statistical criteria to decide which model is the best, but he would look which model has the most meaningful interpretation and has a better answer to the research question.

Latent class models belong to the family of (finite) mixture models. The parameters are estimated by the EM-Algorithm. It´s called EM, because it has two steps: An „E“stimation step and a „M“aximization step. In the first one, class-membership probabilities are estimated (the first time with some starting values) and in the second step those estimates are altered to maximize the likelihood-function. Both steps are iterative and repeated until the algorithm finds the global maximum. This is the solution with the highest possible likelihood. That´s why starting values in latent class analysis are important. I´m a social scientist who applies statistics, not a statistician, but as far as i understand this, depending on the starting values the algorithm can stop at point where one (!) local maximum is reached, but it might not be the „best“ local maximum (the global maximum) and so the algorithm perhaps should´ve run further. If you run the estimation multiple times with different starting values and it always comes to the same solution, you can be pretty sure that you found the global maximum.

data preparation
Latent class models don´t assume the variables to be continous, but (unordered) categorical. The variables are not allowed to contain zeros, negative values or decimals as you can read in the poLCA vignette. If your variables are binary 0/1 you should add 1 to every value, so they become 1/2. If you have NA-values, you have to recode them to a new category. Rating Items with values from 1-5 should could be added a value 6 from the NAs.

mydata[is.na(mydata)] <- 6

Running LCA models
First you should install the package and define a formula for the model to be estimated.

install.packages("poLCA")
library("poLCA")

# By the way, for all examples in this article, you´ll need some more packages:
library("reshape2")
library("plyr")
library("dplyr")
library("poLCA")
library("ggplot2")
library("ggparallel")
library("igraph")
library("tidyr")
library("knitr")

# these are the defaults of the poLCA command
poLCA(formula, data, nclass=2, maxiter=1000, graphs=FALSE, tol=1e-10, na.rm=TRUE, probs.start=NULL, nrep=1, verbose=TRUE, calc.se=TRUE)

#estimate the model with k-classes
k<-3
lc<-poLCA(f, data, nclass=k, nrep=30, na.rm=FALSE, Graph=TRUE)

The following code stems from this article. It runs a sequence of models with two to ten groups. With nrep=10 it runs every model 10 times and keeps the model with the lowest BIC.

# select variables
mydata <- data %>% dplyr::select(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)

# define function
f<-with(mydata, cbind(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)~1) #

#------ run a sequence of models with 1-10 classes and print out the model with the lowest BIC
max_II <- -100000
min_bic <- 100000
for(i in 2:10){
  lc <- poLCA(f, mydata, nclass=i, maxiter=3000, 
              tol=1e-5, na.rm=FALSE,  
              nrep=10, verbose=TRUE, calc.se=TRUE)
  if(lc$bic < min_bic){
    min_bic <- lc$bic
    LCA_best_model<-lc
  }
}    	
LCA_best_model

You´ll get the standard-output for the best model from the poLCA-package:

Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 

$F29_a
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0413 0.2978 0.1638 0.2487 0.1979 0.0428 0.0078
class 2:  0.0000 0.0429 0.0674 0.3916 0.4340 0.0522 0.0119
class 3:  0.0887 0.5429 0.2713 0.0666 0.0251 0.0055 0.0000

$F29_b
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0587 0.2275 0.1410 0.3149 0.1660 0.0697 0.0222
class 2:  0.0000 0.0175 0.0400 0.4100 0.4249 0.0724 0.0351
class 3:  0.0735 0.4951 0.3038 0.0669 0.0265 0.0271 0.0070

$F29_c
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.0371 0.2082 0.1022 0.1824 0.3133 0.1365 0.0202
class 2:  0.0000 0.0086 0.0435 0.3021 0.4335 0.1701 0.0421
class 3:  0.0815 0.4690 0.2520 0.0903 0.0984 0.0088 0.0000

$F27_a
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.7068 0.2373 0.0248 0.0123 0.0000 0.0188 0.0000
class 2:  0.6914 0.2578 0.0128 0.0044 0.0207 0.0085 0.0044
class 3:  0.8139 0.1523 0.0110 0.0000 0.0119 0.0000 0.0108

$F27_b
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.6198 0.1080 0.0426 0.0488 0.1226 0.0582 0.0000
class 2:  0.6336 0.1062 0.0744 0.0313 0.1047 0.0370 0.0128
class 3:  0.7185 0.1248 0.0863 0.0158 0.0325 0.0166 0.0056

$F27_e
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)  Pr(7)
class 1:  0.6595 0.1442 0.0166 0.0614 0.0926 0.0062 0.0195
class 2:  0.6939 0.1474 0.0105 0.0178 0.0725 0.0302 0.0276
class 3:  0.7869 0.1173 0.0375 0.0000 0.0395 0.0000 0.0188

$F09_a
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)
class 1:  0.8325 0.1515 0.0000 0.0000 0.0160 0.0000
class 2:  0.1660 0.3258 0.2448 0.1855 0.0338 0.0442
class 3:  0.1490 0.2667 0.3326 0.1793 0.0575 0.0150

$F09_b
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)
class 1:  0.8116 0.1594 0.0120 0.0069 0.0000 0.0101
class 2:  0.0213 0.3210 0.4000 0.2036 0.0265 0.0276
class 3:  0.0343 0.3688 0.3063 0.2482 0.0264 0.0161

$F09_c
           Pr(1)  Pr(2)  Pr(3)  Pr(4)  Pr(5)  Pr(6)
class 1:  0.9627 0.0306 0.0067 0.0000 0.0000 0.0000
class 2:  0.1037 0.4649 0.2713 0.0681 0.0183 0.0737
class 3:  0.1622 0.4199 0.2338 0.1261 0.0258 0.0322

Estimated class population shares 
 0.2792 0.4013 0.3195 
 
Predicted class memberships (by modal posterior prob.) 
 0.2738 0.4055 0.3206 
 
========================================================= 
Fit for 3 latent classes: 
========================================================= 
number of observations: 577 
number of estimated parameters: 155 
residual degrees of freedom: 422 
maximum log-likelihood: -6646.732 
 
AIC(3): 13603.46
BIC(3): 14278.93
G^2(3): 6121.357 (Likelihood ratio/deviance statistic) 
X^2(3): 8967872059 (Chi-square goodness of fit)

Generate table showing fitvalues of multiple models
Now i want to build a table for comparison of various model-fit values like this:

Modell	 log-likelihood	 resid. df	 BIC	 aBIC	 cAIC	 likelihood-ratio	 Entropy
Modell 1	 -7171.940	 526	 14668.13	 14046.03	 14719.13	 7171.774	 -
Modell 2	 -6859.076	 474	 14373.01	 14046.03	 14476.01	 6546.045	 0.86
Modell 3	 -6646.732	 422	 14278.93	 13786.87	 14433.93	 6121.357	 0.879
Modell 4	 -6528.791	 370	 14373.66	 13716.51	 14580.66	 5885.477	 0.866
Modell 5	 -6439.588	 318	 14525.86	 13703.64	 14784.86	 5707.070	 0.757
Modell 6	 -6366.002	 266	 14709.29	 13721.99	 15020.29	 5559.898	 0.865

This table was build by the following code:

#select data
mydata <- data %>% dplyr::select(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)

# define function
f<-with(mydata, cbind(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)~1) 

## models with different number of groups without covariates:
set.seed(01012)
lc1<-poLCA(f, data=mydata, nclass=1, na.rm = FALSE, nrep=30, maxiter=3000) #Loglinear independence model.
lc2<-poLCA(f, data=mydata, nclass=2, na.rm = FALSE, nrep=30, maxiter=3000)
lc3<-poLCA(f, data=mydata, nclass=3, na.rm = FALSE, nrep=30, maxiter=3000)
lc4<-poLCA(f, data=mydata, nclass=4, na.rm = FALSE, nrep=30, maxiter=3000) 
lc5<-poLCA(f, data=mydata, nclass=5, na.rm = FALSE, nrep=30, maxiter=3000)
lc6<-poLCA(f, data=mydata, nclass=6, na.rm = FALSE, nrep=30, maxiter=3000)

# generate dataframe with fit-values

results <- data.frame(Modell=c("Modell 1"),
                      log_likelihood=lc1$llik,
                      df = lc1$resid.df,
                      BIC=lc1$bic,
                      ABIC=  (-2*lc1$llik) + ((log((lc1$N + 2)/24)) * lc1$npar),
                      CAIC = (-2*lc1$llik) + lc1$npar * (1 + log(lc1$N)), 
                      likelihood_ratio=lc1$Gsq)
results$Modell<-as.integer(results$Modell)
results[1,1]<-c("Modell 1")
results[2,1]<-c("Modell 2")
results[3,1]<-c("Modell 3")
results[4,1]<-c("Modell 4")
results[5,1]<-c("Modell 5")
results[6,1]<-c("Modell 6")

results[2,2]<-lc2$llik
results[3,2]<-lc3$llik
results[4,2]<-lc4$llik
results[5,2]<-lc5$llik
results[6,2]<-lc6$llik

results[2,3]<-lc2$resid.df
results[3,3]<-lc3$resid.df
results[4,3]<-lc4$resid.df
results[5,3]<-lc5$resid.df
results[6,3]<-lc6$resid.df

results[2,4]<-lc2$bic
results[3,4]<-lc3$bic
results[4,4]<-lc4$bic
results[5,4]<-lc5$bic
results[6,4]<-lc6$bic

results[2,5]<-(-2*lc2$llik) + ((log((lc2$N + 2)/24)) * lc2$npar) #abic
results[3,5]<-(-2*lc3$llik) + ((log((lc3$N + 2)/24)) * lc3$npar)
results[4,5]<-(-2*lc4$llik) + ((log((lc4$N + 2)/24)) * lc4$npar)
results[5,5]<-(-2*lc5$llik) + ((log((lc5$N + 2)/24)) * lc5$npar)
results[6,5]<-(-2*lc6$llik) + ((log((lc6$N + 2)/24)) * lc6$npar)

results[2,6]<- (-2*lc2$llik) + lc2$npar * (1 + log(lc2$N)) #caic
results[3,6]<- (-2*lc3$llik) + lc3$npar * (1 + log(lc3$N))
results[4,6]<- (-2*lc4$llik) + lc4$npar * (1 + log(lc4$N))
results[5,6]<- (-2*lc5$llik) + lc5$npar * (1 + log(lc5$N))
results[6,6]<- (-2*lc6$llik) + lc6$npar * (1 + log(lc6$N))

results[2,7]<-lc2$Gsq
results[3,7]<-lc3$Gsq
results[4,7]<-lc4$Gsq
results[5,7]<-lc5$Gsq
results[6,7]<-lc6$Gsq

Now i calculate the Entropy (a pseudo-r-squared) for each solution. I took the idea from Daniel Oberski´s Presentation on LCA.

entropy<-function (p) sum(-p*log(p))

results$R2_entropy
results[1,8]<-c("-")

error_prior<-entropy(lc2$P) # class proportions model 2
error_post<-mean(apply(lc2$posterior,1, entropy),na.rm = TRUE)
results[2,8]<-round(((error_prior-error_post) / error_prior),3)

error_prior<-entropy(lc3$P) # class proportions model 3
error_post<-mean(apply(lc3$posterior,1, entropy),na.rm = TRUE)
results[3,8]<-round(((error_prior-error_post) / error_prior),3)

error_prior<-entropy(lc4$P) # class proportions model 4
error_post<-mean(apply(lc4$posterior,1, entropy),na.rm = TRUE)
results[4,8]<-round(((error_prior-error_post) / error_prior),3)

error_prior<-entropy(lc5$P) # class proportions model 5
error_post<-mean(apply(lc5$posterior,1, entropy),na.rm = TRUE)
results[5,8]<-round(((error_prior-error_post) / error_prior),3)

error_prior<-entropy(lc6$P) # class proportions model 6
error_post<-mean(apply(lc6$posterior,1, entropy),na.rm = TRUE)
results[6,8]<-round(((error_prior-error_post) / error_prior),3)

# combining results to a dataframe
colnames(results)<-c("Model","log-likelihood","resid. df","BIC","aBIC","cAIC","likelihood-ratio","Entropy")
lca_results<-results

# Generate a HTML-TABLE and show it in the RSTUDIO-Viewer (for copy & paste) 
view_kable <- function(x, ...){
  tab <- paste(capture.output(kable(x, ...)), collapse = '\n')
  tf <- tempfile(fileext = ".html")
  writeLines(tab, tf)
  rstudio::viewer(tf)
}
view_kable(lca_results, format = 'html', table.attr = "class=nofluid")

# Another possibility which is prettier and easier to do:
install.packages("ztable")
ztable::ztable(lca_results)

Elbow-Plot
Sometimes, an Elbow-Plot (or Scree-Plot) can be used to see, which solution is parsimonius and has good fit-values. You can get it with this ggplot2-code i wrote:

# plot 1

# Order categories of results$model in order of appearance
install.packages("forcats")
library("forcats")
results$model < - as_factor(results$model) 

#convert to long format
results2<-tidyr::gather(results,Kriterium,Guete,4:7)
results2

#plot
fit.plot<-ggplot(results2) + 
  geom_point(aes(x=Model,y=Guete),size=3) +
  geom_line(aes(Model, Guete, group = 1)) +
  theme_bw()+
  labs(x = "", y="", title = "") + 
  facet_grid(Kriterium ~. ,scales = "free") +
  theme_bw(base_size = 16, base_family = "") +   
  theme(panel.grid.major.x = element_blank() ,
        panel.grid.major.y = element_line(colour="grey", size=0.5),
        legend.title = element_text(size = 16, face = 'bold'),
        axis.text = element_text(size = 16),
        axis.title = element_text(size = 16),
        legend.text=  element_text(size=16),
        axis.line = element_line(colour = "black")) # Achsen etwas dicker

# save 650 x 800
fit.plot

Inspect population shares of classes
If you are interested in the population-shares of the classes, you can get them like this:

round(colMeans(lc$posterior)*100,2)
[1] 27.92 40.13 31.95

or you inspect the estimated class memberships:

table(lc$predclass)
  1   2   3 
158 234 185 

round(prop.table(table(lc$predclass)),4)*100
   1     2     3 
27.38 40.55 32.06

Ordering of latent classes
Latent classes are unordered, so which latent class becomes number one, two, three... is arbitrary. The latent classes are supposed to be nominal, so there is no reason for one class to be the first. You can order the latent classes if you must. There is a function for manually reordering the latent classes: poLCA.reorder()
First, you run a LCA-model, extract the starting values and run the model again, but this time with a manually set order.

#extract starting values from our previous best model (with 3 classes)
probs.start<-lc3$probs.start

#re-run the model, this time with "graphs=TRUE"
lc<-poLCA(f, mydata, nclass=3, probs.start=probs.start,graphs=TRUE, na.rm=TRUE, maxiter=3000)

# If you don´t like the order, reorder them (here: Class 1 stays 1, Class 3 becomes 2, Class 2 becomes 1)
new.probs.start<-poLCA.reorder(probs.start, c(1,3,2))

#run polca with adjusted ordering
lc<-poLCA(f, mydata, nclass=3, probs.start=new.probs.start,graphs=TRUE, na.rm=TRUE)
lc

Now you have reordered your classes. You could save these starting values, if you want to recreate the model anytime.

saveRDS(lc$probs.start,"/lca_starting_values.RData")

Plotting

This is the poLCA-standard Plot for conditional probabilites, which you get if you add "graph=TRUE" to the poLCA-call.

It´s in a 3D-style which is not really my taste. I found some code at dsparks on github or this Blog, that makes very appealing ggplot2-plots and did some little adjustments.

lcmodel <- reshape2::melt(lc$probs, level=2)
zp1 <- ggplot(lcmodel,aes(x = L1, y = value, fill = Var2))
zp1 <- zp1 + geom_bar(stat = "identity", position = "stack")
zp1 <- zp1 + facet_grid(Var1 ~ .) 
zp1 <- zp1 + scale_fill_brewer(type="seq", palette="Greys") +theme_bw()
zp1 <- zp1 + labs(x = "Fragebogenitems",y="Anteil der Item-\nAntwortkategorien", fill ="Antwortkategorien")
zp1 <- zp1 + theme( axis.text.y=element_blank(),
                    axis.ticks.y=element_blank(),                    
                    panel.grid.major.y=element_blank())
zp1 <- zp1 + guides(fill = guide_legend(reverse=TRUE))
print(zp1)

If you want to compare the items directly:

zp2 <- ggplot(lcmodel,aes(x = Var1, y = value, fill = Var2))
zp2 <- zp2 + geom_bar(stat = "identity", position = "stack")
zp2 <- zp2 + facet_wrap(~ L1)
zp2 <- zp2 + scale_x_discrete("Fragebogenitems", expand = c(0, 0))
zp2 <- zp2 + scale_y_continuous("Wahrscheinlichkeiten \nder Item-Antwortkategorien", expand = c(0, 0))
zp2 <- zp2 + scale_fill_brewer(type="seq", palette="Greys") +
  theme_bw()
zp2 <- zp2 + labs(fill ="Antwortkategorien")
zp2 <- zp2 + theme( axis.text.y=element_blank(),
                    axis.ticks.y=element_blank(),                    
                    panel.grid.major.y=element_blank()#,
                    #legend.justification=c(1,0), 
                    #legend.position=c(1,0)
)
zp2 <- zp2 + guides(fill = guide_legend(reverse=TRUE))
print(zp2)

How to plot correlations of rating items with R

2015-07-14 by Niels

Especially when you´re starting a SEM-analysis you should have a look at your observed variables and the correlations between them. As usual with R, there´s ~~an app~~ a package for that. It´s called corrplot and seems to rely on base R graphics.

library("corrplot")
data(mtcars)
M <- cor(mtcars)

## different color scale and methods to display corr-matrix
corrplot(M, method="number", col="black", cl.pos="n")

Pretty basic, huh? If you would rather have a good looking table, try the mighty sjPlot-package. It uses spearman-correlation as default.

sjPlot::sjt.corr(mtcars, 
                 pvaluesAsNumbers=FALSE,
                 triangle="lower",
                 stringDiagonal=c(1,1,1,1,1,1,1,1,1,1),
                 CSS=list(css.thead="border-top:double black; font-weight:normal; font-size:0.9em;",
                          css.firsttablecol="font-weight:normal; font-size:0.9em;"))

Also thanks to the sjPlot-package from Daniel Lüdecke you only need this code:

sjp.corr(mtcars)

to produce this plot:

As usual, i´ve had my own thoughts on how a perfect correlation matrix for my sociological survey-data should look like. This was my idea: In the lower triangle it contains a jittered scatterplot of the responses. A scatterplot of rating-items with 5 categories does not work without jitter. Diagonally are the univariate distributions of the items as barplot, with one bar for each category. In the upper triangle you find the correlation (spearman) and p-value.

library("ggplot2")
library("plyr")
library("dplyr")
library("gtable")
library("sjPlot")

#pass the items as a dataframe to the function
dat<- yourdata %>% select(youritem1,youritem2...)

#------------
# Number of items, generate labels, and set size of text for correlations and item labels
n <- dim(dat)[2]
labels <- paste0("Item ", 1:n)
sizeItem = 16
sizeCor = 4


## List of scatterplots
scatter <- list()

for (i in 2:n) {
  for (j in 1:(i-1)) {
    
    # Data frame 
    df.point <- na.omit(data.frame(cbind(x = dat[ , j], y = dat[ , i])))
    
    # Plot
    p <- ggplot(df.point, aes(x, y)) +
      geom_jitter(size = .7, position = position_jitter(width = .2, height= .2)) +
      stat_smooth(method="lm", colour="black") +
      theme_bw() + theme(panel.grid = element_blank())
    
    name <- paste0("Item", j, i)
    scatter[[name]] <- p
  } }


## List of bar plots
bar <- list()
for(i in 1:n) {
  
  # Data frame
  bar.df <- as.data.frame(table(dat[ , i], useNA = "no"))
  names(bar.df) <- c("x", "y")
  
  # Plot
  p <- ggplot(bar.df) + 
    geom_bar(aes(x = x, y = y), stat = "identity", width = 0.6) +
    theme_bw() +  theme(panel.grid = element_blank()) +
    ylim(0, max(bar.df$y*1.05)) 
  
  name <- paste0("Item", i)
  bar[[name]] <- p
}


## List of tiles
tile <- list()

for (i in 1:(n-1)) {
  for (j in (i+1):n) {
    
    # Data frame 
    df.point <- na.omit(data.frame(cbind(x = dat[ , j], y = dat[ , i]))) # na.omit returns the object with incomplete cases removed
   
    x = df.point[, 1]
    y = df.point[, 2]
    correlation = cor.test(x, y,method="spearman",use="pairwise") #unter getOption("na.action") nachsehen ==> na.action = na.omit
    cor <- data.frame(estimate = correlation$estimate,
                      statistic = correlation$statistic,
                      p.value = correlation$p.value)
    cor$cor = paste0("r = ", sprintf("%.2f", cor$estimate), "\n", 
                    # "t = ", sprintf("%.2f", cor$statistic), "\n",
                     "p = ", sprintf("%.3f", cor$p.value))
    
    
    # Plot
    p <- ggplot(cor, aes(x = 1, y = 1)) +
      geom_tile(aes(fill = estimate)) +
      geom_text(aes(x = 1, y = 1, label = cor),
                colour = "White", size = sizeCor, show_guide = FALSE) +
      theme_bw() + theme(panel.grid = element_blank()) +
      scale_fill_gradient2(limits = c(-1,1), midpoint=0,low=("black"),mid="grey",high=("black")) #Neu hinzugefügt
    
    name <- paste0("Item", j, i)
    tile[[name]] <- p
  } }


# Convert the ggplots to grobs, 
# and select only the plot panels
barGrob <- llply(bar, ggplotGrob)
barGrob <- llply(barGrob, gtable_filter, "panel")

scatterGrob <- llply(scatter, ggplotGrob)
scatterGrob <- llply(scatterGrob, gtable_filter, "panel")

tileGrob <- llply(tile, ggplotGrob)
tileGrob <- llply(tileGrob, gtable_filter, "panel")


## Set up the gtable layout
gt <- gtable(unit(rep(1, n), "null"), unit(rep(1, n), "null"))


## Add the plots to the layout
# Bar plots along the diagonal
for(i in 1:n) {
  gt <- gtable_add_grob(gt, barGrob[[i]], t=i, l=i)
}

# Scatterplots in the lower half
k <- 1
for (i in 2:n) {
  for (j in 1:(i-1)) {
    gt <- gtable_add_grob(gt, scatterGrob[[k]], t=i, l=j)
    k <- k+1
  } }

# Tiles in the upper half
k <- 1
for (i in 1:(n-1)) {
  for (j in (i+1):n) {
    gt <- gtable_add_grob(gt, tileGrob[[k]], t=i, l=j)
    k <- k+1
  } }


# Add item labels
gt <- gtable_add_cols(gt, unit(1.5, "lines"), 0)
gt <- gtable_add_rows(gt, unit(1.5, "lines"), 2*n)

for(i in 1:n) {
  textGrob <- textGrob(labels[i], gp = gpar(fontsize = sizeItem)) 
  gt <- gtable_add_grob(gt, textGrob, t=n+1, l=i+1)
}

for(i in 1:n) {
  textGrob <- textGrob(labels[i], rot = 90, gp = gpar(fontsize = sizeItem)) 
  gt <- gtable_add_grob(gt, textGrob, t=i, l=1)
}


# Add small gap between the panels
for(i in n:1) gt <- gtable_add_cols(gt, unit(0.2, "lines"), i)
for(i in (n-1):1) gt <- gtable_add_rows(gt, unit(0.2, "lines"), i)


# Add chart title
gt <- gtable_add_rows(gt, unit(1.5, "lines"), 0)
textGrob <- textGrob("Korrelationsmatrix", gp = gpar(fontface = "bold", fontsize = 16)) 
gt <- gtable_add_grob(gt, textGrob, t=1, l=3, r=2*n+1)


# Add margins to the whole plot
for(i in c(2*n+1, 0)) {
  gt <- gtable_add_cols(gt, unit(.75, "lines"), i)
  gt <- gtable_add_rows(gt, unit(.75, "lines"), i)
}


# Draw it
grid.newpage()
grid.draw(gt)

Kudos to Stackoverflow User Sany Muspratt for helping me to figure this out.

And here it is in its full beauty (and grey for cheaper printing):

Plot with background in ggplot2: Visualising line-ups from Hurricane-festival 1997 – 2015

2015-06-04 by Niels

The Hurricane Festival is taking place again this june. It could be interesting to have a look on its development over the years. In this case, the amount of bands for each year.

First, i gathered some data from Wikipedia and put it in a csv-file. You can access the data here:Hurricane Festival Bands 1997-2015.

A simple barplot is the best way to plot this data. But to make it a little more appealing, i want to use a custom font and a wallpaper from the Hurricane Festival website. But before i start plotting, i need to get the data in shape.

#These packages will be needed
library("dplyr")
library("tidyr")
library("ggplot2")
library("jpeg")
library("grid")
library("extrafont")

# read in the data
hurricane<-read.csv("Gesamt_1997_2015.csv", header=F, sep=";")
colnames(hurricane)<-c("bands","year")

With some Dplyr-magic we aggregate the count of bands per year:

# Group by year and count the number of bands in each year
plot.df<- hurricane %>% group_by(year) %>% summarise(count=n())
plot.df$year<-as.factor(plot.df$year)

I want to use the font „Open Sans“. You can download the font here Open Sans. Then you have to import it in R using the „extrafont“-package.

font_import(paths="/Open_Sans/")
loadfonts(device="win") #otherwise i get errors on my windows-PC
fonts()

As the title of this blog-entry suggests, i also want to use jpg-background for my graph. I downloaded the wallpaper from the Hurricane Festival from here http://www.hurricane.de/de/interaktiv/downloads/.

Now we can plot the data:

# Import the Wallpaper
img <- readJPEG("wallpaper-hurricane-1920-1080.jpg")

# start plotting
plot<-ggplot(plot.df,aes(x=year,y=count)) + 
  annotation_custom(rasterGrob(img, width=unit(1,"npc"), height=unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf) +
  scale_y_continuous(expand=c(0,0), limits = c(0,max(plot.df$count)*1.05))+
  geom_bar(stat="identity",fill="white",width=0.8)+ 
  geom_text(aes(label=plot.df$count), vjust=1.5,colour="black") +
  theme_bw() +
  theme(text=element_text(family="Open Sans"),
        plot.title = element_text(size = rel(1.5), face = "bold", vjust = 1.5),
        axis.line=element_blank(),
        axis.text.y=element_blank(),
        #axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle("How many Bands had each Hurricane-Festival in the years 1997-2015")+
  labs(x="@Niels_Bremen")
plot

With „alpha=0.85“ the bars become a little bit transparent, so you can see a bit more of the background-image.

plot<-ggplot(plot.df,aes(x=year,y=count)) + 
  annotation_custom(rasterGrob(img, width=unit(1,"npc"), height=unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf) +
   scale_y_continuous(expand=c(0,0), limits = c(0,max(plot.df$count)*1.05))+
  geom_bar(stat="identity",fill="white",width=0.8, alpha=0.85)+ 
  geom_text(aes(label=plot.df$count), vjust=1.5,colour="black") +
  theme_bw() +
  theme(text=element_text(family="Open Sans"),
        plot.title = element_text(size = rel(1.5), face = "bold", vjust = 1.5),
        axis.line=element_blank(),
        axis.text.y=element_blank(),
        #axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle("How many Bands had each Hurricane-Festival in the years 1997-2015?")+
  labs(x="@Niels_Bremen")
plot

By the way, this is what the plot would look like with ggplot2-defaults:

Not that bad, for just one line of code.

ggplot(plot.df,aes(x=year,y=count)) +geom_bar(stat="identity")

Anyway, here are some further graphics:
Here, i tried to give one point for each time a band has played at hurricane-festival. I tried to use an icon (png-file) of a hand instead of a point, but haven´t figured out how to do it.

#only the Top29 Bands
plot.df2<-count.bands %>%
  filter(min_rank(desc(count)) <= 29) %>%
  arrange(desc(count))

#Preparing data for the plot
plot.df3 <- data.frame(band = rep(plot.df2$bands, plot.df2$count),
                      count = unlist(lapply(plot.df2$count, seq_len)))

#load font
font_import(paths="e:/Blog/Hurricane/Open_Sans/")
loadfonts(device="win")
fonts()

#background
img <- readJPEG("e:/Blog/Hurricane/wallpaper-hurricane-800x450.jpg")

#plotting
ggplot(plot.df3, aes(x = count, y=reorder(band,count))) +
  annotation_custom(rasterGrob(img, width=unit(1,"npc"), height=unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(colour="white") + 
  scale_x_continuous(limits=c(1, 7), breaks=seq(1,7, by=1)) +
  theme_bw()  +
  ggtitle("Most frequent bands at Hurricane-Festival (1997-2015)")+
  theme(text=element_text(family="Open Sans"),
        plot.title = element_text(size = rel(1.5), face = "bold", vjust = 1.5),
        axis.title.y=element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.x=element_blank())

As a last plot, i did a wordcloud from all Bands that ever played at Hurricane.

library(wordcloud)
library(RColorBrewer)

#Plot and save wordcloud image
png('e:/wordcloud_hurricane.png', width=1200,height=1200,res=260)
wordcloud(count.bands$bands, count.bands$count2,
          scale = c(1.4, .2),
          min.freq=1,
          max.words=Inf, 
          random.order=FALSE, 
          rot.per=0, 
          colors="Darkgreen",
          bg = "transparent")
dev.off()

Drawing path diagrams of structural equation models (SEM) for publication

2015-03-20 by Niels

Visualisation of structural equation models is done with path diagrams. They are an important means to give your audience an easier access to the equation system, that represents the theory you want to test. A path diagram is kind of like a flow-chart that uses arrows to show direct and indirect causal links between your exogenous and endogenous variables, as well as your latent and your observed variables. As structural equation models can become complex and contain a lot of parameters to describe the relationships between observed and latent variables, it´s an important step to visualize them properly. The automatically produced path-diagrams are often good enough as you work out your model, but they´re not polished enough for publication. In this post, i´ll show a selection of tools and their output.

There are many software solutions to do structural equation modeling. LISREL, AMOS, MPLUS, STATA, SAS, EQS and the R-packages sem, OpenMX, lavaan, Onyx – just to name the most popular ones. Most of these solutions have a built-in possibility to visualize their models. AMOS is a special case, because the modeling is done via drawing path diagrams. Onyx can do this, too. This can make it easy, especially for beginners. Sometimes you can find these AMOS path diagrams beeing published in articles.

In my experience the other SEM-tools (LISREL,MPLUS,STATA) don´t produce very appealing diagrams. Especially if your model is a little bigger. When it comes to the R-packages, there are significantly better attempts to generate visualisations of structural equation models. As a third solution, you can just use usual graphics software and type parameter-estimates by hand. It seems to me, that – at this point – this will generate the highest quality path diagrams.

Path diagrams consist of rectangles for observed variables, ellipses for latent variables, curves with arrow-heads on both sides for correlations and most important: straight lines with arrow-heads on one end as paths, that link a predicting and a predicted variable. Here is an example of what it could look like:

In the rest of this blog entry, i will show you examples of path diagrams:

1. Commercial Software

2. R-Packages

3. Extern graphic software

1. Solutions for automatical SEM-diagrams (commercial software)

2. Built-in solutions for SEM-diagrams (R-packages)
There are several R-packages for SEM-analysis. The fit-objects of these packages can be visualized. This list is not complete.

lavaan:

For lavaan, the best way to get path diagrams would be the semPlot-package by Sascha Epskamp (Project Homepage). Examples can be found here.

I don´t have much experience with the semPlot-package, but i think it´s offers a fast and good solution for CFA-pathdiagrams or small SEM-pathdiagram. Bigger pathdiagrams will need more work. Here´s a little example for a two-factor CFA:

pathdiagram<-semPaths(fit,whatLabels="std", intercepts=FALSE, style="lisrel",
                       nCharNodes=0, 
                       nCharEdges=0,
                       curveAdjacent = TRUE,title=TRUE, layout="tree2",curvePivot=TRUE)

sem:

For the sem-package by John Fox , there is a function named „pathDiagram()“, which produces graphviz/dot-code that can be imported in graphviz. The dot-code is a description, that defines the latent and manifest variables as nodes and the interconnections as edges of a diagram.
The semPlot-package also supports the sem-package.

OpenMX: For OpenMX, a free SEM-software that can be run via R. Exporting the model to dot-code and plotting it with graphviz is the recommended workflow.

Onyx: Onyx by Andreas Brandmaier is a free standalone SEM-tool. It offers an Amos-like graphical interface to specify the model and is capable of importing OpenMX-Code, but not lavaan-code.

Screenshots can be found here.
UPDATE:
Andreas Brandmaier wrote an experimental R-package that connects his SEM-Tool Onyx with R. It can be found here https://github.com/brandmaier/onyxR. I haven´t tried it yet, but it seems to take models from lavaan or OpenMX (R) and tries to generate pathdiagrams from it. If this works, this blogpost is complete and will be rewritten shortly.

DiagrammeR: Twitter-user @timelyportfolio (thank you!) recommended me the R-Package DiagrammeR by Richard Iannone. It doesn´t import fit-models from SEM-packages, but has it´s strengths in an easy syntax and fastly growing feature-list. I think, it´s very worthy to give it a try, because the path diagrams are not as hard to do as with graphviz but also reproducible.

UPDATE: Richard Iannone produced this example for me on stackoverflow

devtools::install_github("rich-iannone/DiagrammeR")
library(DiagrammeR)

grViz("
digraph SEM {

graph [layout = neato,
       overlap = true,
       outputorder = edgesfirst]

node [shape = rectangle]

a [pos = '-4,1!', label = 'e1', shape = circle]
b [pos = '-3,1!', label = 'ind_1']
c [pos = '-3,0!', label = 'ind_2']
d [pos = '-3,-1!', label = 'ind_3']
e [pos = '-1,0!', label = 'latent a', shape = ellipse]
f [pos = '1,0!', label = 'latent b', shape = ellipse]
g [pos = '1,1!', label = 'e6', shape = circle]
h [pos = '3,1!', label = 'ind_4']
i [pos = '3,-1!', label = 'ind_5']
j [pos = '4,1!', label = 'e4', shape = circle]
k [pos = '4,-1!', label = 'e5', shape = circle]

a->b
e->b [label = '0.6']
e->c [label = '0.6']
e->d [label = '0.6']

e->f [label = '0.321', headport = 'w']
g->f [tailport = 's', headport = 'n']

d->c [dir = both]

f->h [label = '0.6', tailport = 'ne', headport = 'w']
f->i [label = '0.6']

j->h
k->i

}
"

This produces this path-diagram:

update on DiagrammeR for SEM
Recently Tristan Mahr blogged his proof-of-concept that it´s possible to convert a lavaan-dataframe into node and edge dataframes for DiagrammeR. Wow, i´m really curious if this approach will be pursued any further.
Here is the link: https://rpubs.com/tjmahr/sem_diagrammer

The psychologist Andrey Lovakov also did an example for a SEM pathdiagram with DiagrammeR: https://github.com/lovakov/Lecturers-Org-Commitment/blob/master/Figure%201

another update on pathdiagrams in R
Stas Kolenikov from the University of Missouri did another example for SEM-pathdiagrams in R on his website http://staskolenikov.net/graphviz_sem.html. Instead of DiagrammeR he uses Graphviz. A problem he encountered concerns displaying covariances by curved two-sided arrows. It´s possible to do this, but as he writes „their aesthetic appeal is probably not that great“.

3. other / graphics software (selection)
If you want to use Graphviz or Tikz, you´ll get to very good looking diagrams, but you´ll also have to learn the „dot language“. If you have to do a lot of diagrams it can be worth learning it, but for my purposes, it´s kind of overkill.
Here are some Graphviz-Examples: pathdiagram with Graphviz

This leads us to „normal“ multi-purpose graphics software. Doing the graphs with an office-suite is pretty straightforward and selfexplaining. On the other hand, i wouldn´t trust office that everything stays in its place, when i move it around in a document.
Inkscape is a tool, that´s often mentioned by SEM-analysts. At the moment, i´m giving yed a try, which seems to be easy and produce quick and good looking graphs. Dia could also be an alternative, but i haven´t tried it, yet.

request for tipps
I´m really looking out for best practices in drawing path diagrams for structural equation models. Please leave a comment, if you know another tool, that isn´t listed, or if you have a workflow, that can be adapted by others. I think there´s a gap between working-state path-diagrams and diagrams suitable for publication.

How to apply survey weights in structural equation modeling (SEM) with lavaan.

2015-03-17 by Niels

The R-Package lavaan is my favourite tool for fitting structural equation models (SEM). Its biggest advantages: It´s free, it´s open source and its range of functions is growing steadily.
Before lavaan, i used MPLUS, which still has the widest functionality of all SEM-Tools and is the most sophisticated software for latent variable modeling. The Muthéns and their MPLUS-team offer incredibly good support and documentation. The only problem is, that the software isn´t free and without a license you can´t get any of the support.
For me, one drawback of lavaan is, that it can´t model latent class models or mixture models …yet! Yves Rosseel is planning to add this in the next two years.

lavaan stands for „latent variable analysis“. The package is available via CRAN and has a good tutorial on the lavaan project homepage. Models are specified via syntax. Thankfully, the lavaan-syntax is kept pretty simple. At least, it´s a lot easier than the LISREL-syntax (the first, and original SEM-software). But it´s not as easy as drawing a path-model in AMOS, the SPSS-module. Anyway, once you get to a little more complex models, you´ll find working with syntax a lot more efficient. If you don´t like working with syntax, i recommend having a look at Onyx – a graphical interface for structural equation modeling by Andreas Brandmaier. It´s a free tool in which you can draw your SEM as a path diagram and generate the lavaan-syntax from it.
But, when you do SEM-models the syntax will be the least complicated thing you had to learn, so i don´t think that will be a problem at all.

Install lavaan
If you want to use survey weights, you have to install lavaan, the survey package and lavaan.survey. Lavaan is the package used for modeling and the survey-package converts your data into an survey-design-object. After you specified the model in a lavaan fit object and you have generated a survey-design-object from your data, these two objects are passed to the lavaan.survey function, which will calculate the weighted model.

First, you install the packages:

#Install lavaan
install.packages("lavaan", dependencies=TRUE)
library(lavaan)

#install lavaan.survey
install.packages("lavaan.survey")
library(lavaan.survey)

#Install survey-package
install.packages("survey")
library(survey)

Generate the survey-design object
After the packages and the data are loaded, a svydesign-object is generated from our data. It´s not a suprise, that with „id=~ID“ the column „ID“ in the dataframe will be used as id-variable. With „weights= ~weights_trunc“ the column which holds the survey-weights is defined and with „data=data“ the dataframe is chosen.

library("survey") #load survey package 
data<- read.csv(file = "data.csv", header=T, sep=",") #read data

#if necessary - recode missing value "9" to NA
df[df== 9] <- NA	

#generate survey-design object
svy.df<-svydesign(id=~ID, 
                  weights=~weight_trunc,
                  data=data)

Specifying the model
I´ll use a simple structural equation model with two latent variables, measured by three and two indicator-variables. The exogenous latent variable „latent_a“ is measured by x1-x3, the endogenous latent variable „latent_b“ is measured by y1-y2. The variable „latent_b“ is regressed on (predicted by) „latent_a“.

library(lavaan)
model_1 <- '# measurement model
              latent_a =~ F09_a + F09_b + F09_c
              latent_b =~ F12_a + F12_b 

             # regressions
              latent_b ~ latent_a
            '

lavaan.fit <- sem(model_1, 
                     data=data,                      
                     estimator="MLR", # robust fit / when you have missing data
                     missing = "ml",               #fiml for missing data
                     mimic="Mplus")

#you can run the model (unweighted) at this point and inspect it
summary(lavaan.fit,fit.measures=TRUE, standardized=TRUE)

Normally, i would use MLM as estimator to get robust estimates (robust against non-normality of the endogenous variable), but in this case i chose MLR, because FIML is not available with MLM.
FIML (Full Information Maximum Likelihood algorithm- defined with missing=“ml“) is regarded as equally efficiant to multiple imputation in handling item-nonresponse. But, it can be a good idea to do multiple imputation anyway, because bootstrapping the standard errors is only available with ML-estimator. On the other Hand, it´s an advantage that with FIML it´s not necessary to explicitly model missingess, because FIML uses the already specified SEM.

When using the lavaan.survey-package, you can´t use fiml (yet). You have to do a multiple imputation for your data, if you have missings, and instead of MLR lavan.survey uses MLM as default.

Fitting the model
When the model is fitted with lavaan.survey, the covariance-matrix will be estimated using the svyvar-object generated by the survey-package . The lavaan model uses this weighted covariance-matrix with the MLM-estimator to fit the model. MLM is not compatible with missing=“fiml“, so if your data has missings you have to do multiple imputation first and pass your imputed dataframes as a list to the svydesign-package so it becomes a svy.design-object which can be used as data in lavaan.survey. The resulting parameters, fit indices and statistics will be adjusted for the sampling design. Also, if MLM is used, the chi-square (likelihood-ratio) test-statistic will be transformed to a Satorra-Bentler corrected chi-square. [This information stems from the lavaan.survey documentation]. In lavaan, you can choose the form of your output. Because i worked a lot with MPLUS, i prefer the MPLUS-Output.

library(lavaan.survey)

#Fit the model using weighted data (by passing the survey-design object we generated above)
survey.fit <- lavaan.survey(lavaan.fit, 
                            survey.design, 
                            estimator="ML") 

#inspect output
summary(survey.fit,
        fit.measures=TRUE, 
        standardized=TRUE,
        rsquare=TRUE)

# if you´re interested in descriptive statistics
# you can access the missing data patterns 
inspect(fit, 'patterns') 

# and the coverage of the covariance matrix (like in MPLUS)
inspect(fit, 'coverage')

Results
I wouldn´t have expected that using weights in a SEM-analysis with lavaan is so easy to accomplish.
Here are the fit-indices of the weighted SEM.

lavaan (0.5-17) converged normally after  24 iterations

  Number of observations                           577

  Estimator                                         ML
  Minimum Function Test Statistic               11.664
  Degrees of freedom                                 4
  P-value (Chi-square)                           0.020

Model test baseline model:

  Minimum Function Test Statistic              955.394
  Degrees of freedom                                10
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.992
  Tucker-Lewis Index (TLI)                       0.980

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3675.100
  Loglikelihood unrestricted model (H1)      -3669.268

  Number of free parameters                         16
  Akaike (AIC)                                7382.200
  Bayesian (BIC)                              7451.926
  Sample-size adjusted Bayesian (BIC)         7401.132

Root Mean Square Error of Approximation:

  RMSEA                                          0.058
  90 Percent Confidence Interval          0.021  0.097
  P-value RMSEA <= 0.05                          0.314

Standardized Root Mean Square Residual:

  SRMR                                           0.022

…and so on. I don´t show the whole results.

It´s common to show the parameter-estimates in a path-diagram. In my next blogging-session i´ll demonstrate how to draw path diagrams of a lavaan-model with SEMPLOT (Project Homepage).

Recoding all vectors in a dataframe at once (R)

2015-03-11 by Niels

This post is just a little reminder for myself, because i had to look this up a few times.
Version 1

#Before:
[1] 5 4 3 5 3 4

#After:
[1] Stimme gar nicht zu  
Stimme eher nicht zu 
Teils / teils       
Stimme gar nicht zu  
Teils / teils        
Stimme eher nicht zu

I use „recode“ from the car package, to recode the variables. This function will be used with „apply“ so it will recode all vectors in the dataframe.

Here´s how it´s done:

df<- apply(df, 2, function(x) {
  x <- car::recode(x,"1='Stimme voll und ganz zu'; 2='Stimme eher zu';3='Teils / teils';4='Stimme eher nicht zu';5='Stimme gar nicht zu';9=NA"); x
  })
df<-as.data.frame(df)

Version 2
If you only want to recode some of the variables in you dataframe, you can define them in a list and use this list in a for-loop.

library(car)
var.list<-c("var1","var3","var5")
for (v in var.list)
  df[[v]]<-recode(df[[v]], "1=5;2=4;4=2;5=1")

Version 3
If the values just have to be reversed, there´s an even simpler way:

recode.list<-c("variable1","variable2") 
df[recode.list] <- 6 - df[recode.list]

Diverging stacked barchart for plotting likert Items

2015-03-09 by Niels

Questionnaires in the social sciences often include rating items to measure the variability of peoples´ attitudes towards something. Respondents are given a statement and have to report how much they agree or disagree on a 5- or 7-point-scale. A set of rating-items like these can be combined to a likert-scale. It´s also common to build an index-value for the respondents, if the items meet certain criteria of quality. There´s still some controversy, if it´s adequate to use ranked (ordinal) data like likert-items to calculate means. Most researchers think it´s approriate if the scale has at least 5 points and the variable can be considered as an ordinal measure of a continuous attitude.

Anyway. Visualization of the data ist always a good starting point. For this purpose, there are a lot of R-Packages like the HH-Package with its Likert-Function, or the likert-package from Jason Bryer and last, but not least: The sjp.likert-Function from Daniel Lüdecke, which would be my favourite.

All these packages produce sophisticated and very appealing plots. Under its hood, the HH-package uses lattice and the likert and sjPlot package are build on ggplot2. I tried HH-package, but as a ggplot2-user i realized, it would take me too long to figure out the little details. The other two packages could do what i want, but they both need raw-data (SPSS-like) and can´t work with already aggregated data. Both also have distinct kinds of dealing with the „neutral“-category of the items.
Long story short, i decided to use ggplot2 directly instead of using packages build on ggplot2 that have developed a lot of complexity on their own.

The Plot
This plot is a small example. If the code seems too messy to you, or you think the plot can be improved: i´m always interested in how to make things better, please leave a comment.
For example, one could criticize, that the x-axis isn´t meaningful, because of the neutral-category should not be splitted in negative/positive like this. So perhaps, the vertical line and the x-axis-labels should be removed. On the other hand, the HH-Plot likert-function does it the same way. It would be possible to add percentage-values inside the stacked bars, but i think that would be too much. I decided, to make a stacked-frequency table with the sjPlot-Package to complement my likert-plot.

And this is the code, i´ve written:

library("plyr")
library("dplyr")
library("ggplot2")

# example data
Variable<-c("1","1","1","1","1","2","2","2","2","2","3","3","3","3","3","4","4","4","4","4")
level<-c(5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1)
perc_w<-c(3.70,11.80,10.10,25.80,38.60,2.00,16.90,13.25,28.80,25.80,1.80,6.50,9.35,33.60,39.40,3.50,12.40,14.10,34.80,21.10)
df<-data.frame(Variable,level,perc_w)
df$perc_w<-as.numeric(df$perc_w)
df$level<-as.factor(df$level)

# item text
items<-c("~ It´s not known, if climate change is real",
         "~ In my opinion, the risks of climate change are exaggerated by activists",
         "~ Climate change is not as dangerous as it is claimed", 
         "~ I´m convinced that we can handle climate change")

df$Variable<-as.character(df$Variable)
df$Variable[df$Variable==1]<-items[1]
df$Variable[df$Variable==2]<-items[2]
df$Variable[df$Variable==3]<-items[3]
df$Variable[df$Variable==4]<-items[4]
df$Variable<-as.ordered(df$Variable)

# calculate halves of the neutral category
df.split <-df %>% filter(level==3) %>% mutate(perc_w=as.numeric(perc_w/2)) 

# replace old neutral-category
df<-df %>% filter(!level==3)   
df<-full_join(df,df.split) %>% arrange(level)  %>% arrange(desc(Variable)) 


#split dataframe
df1<-df %>% filter(level == 3 | level== 2 | level==1) 
df2<-df %>% filter(level == 5 | level== 4 | level==3) %>% mutate(perc_w = perc_w *-1)

# automatic line break
df1$Variable  <-str_wrap(df1$Variable, width = 41) 
df2$Variable  <-str_wrap(df2$Variable, width = 41) 

# reorder factor "Variable"
df1$Variable   <- factor(df1$Variable, levels=rev(unique(df1$Variable)))
df2$Variable   <- factor(df2$Variable, levels=rev(unique(df2$Variable)))

#Plot  
p<-ggplot() +
  geom_bar(data=df1, aes(x = Variable, y=perc_w, fill = level, order = -as.numeric(level)),position="stack", stat="identity") +
  geom_bar(data=df2, aes(x = Variable, y=perc_w, fill = level, order = as.numeric(level)),position="stack", stat="identity") +
  geom_hline(yintercept = 0, color =c("black"))+
  theme_bw() + 
  coord_flip() +
  guides(fill=guide_legend(title="",reverse=TRUE)) +
  scale_fill_brewer(palette="Blues", name="",labels=c("--","-","0","+","++")) +
  labs(title=expression(atop(bold("Attitudes towards climate change"),
                             atop(italic("Some roughly translated items"),""))),
       y="percentages",x="") +  
  theme(legend.position="top",
        axis.ticks = element_blank(), 
        plot.title = element_text(size=25),
        axis.title.y=element_text(size=16),
        axis.text.y=element_text(size=13),
        axis.title.x=element_text(size=16),
        axis.text.x=element_text(size=13),
        legend.title=element_text(size=14),
        legend.text=element_text(size=12)    
  )
p

R-Tipps: Dplyr in Funktionen nutzen

2015-01-27 by Niels

Seit es das R-Package dplyr gibt, sind viele Aufgaben bei der Datenbearbeitung einfacher geworden. Soll diesselbe Umformung für viele Variablen durchgeführt werden, kann es eine Menge Tipparbeit sparen, wenn man den Code in eine Funktion packt.

Bei der Arbeit mit dplyr ist es wichtig, dass man plyr – sofern man es benötigt – grundsätzlich als erstes läd. Andernfalls treten mit Sicherheit an irgendeinem Punkt der Arbeit Fehlermeldungen auf. Summarise() gibt es beispielsweise in plyr, aber auch in dplyr, was zu Konflikten wie diesem führt: „Error in n() : This function should not be called directly„.

# Installieren der nötigen Pakete
install.packages("plyr")
install.packages("dplyr")
install.packages("tidyr")
install.packages("lazyeval") #oder: devtools::install_github("hadley/lazyeval")
library("plyr")
library("dplyr")
library("tidyR")
library("lazyeval")

Was macht dplyr so besonders?
Dplyr-Code ist einfach lesbar, logisch und arbeitet sequenziell, da es den Pipe-Operator %>% aus aus dem MagrittR-Package nutzt.
Durch „%>%“ (sprich: then) können Manipulationen in einer intuitiven Reihenfolge ausgeführt werden. Während früher eine Funktion umständlich mit verschachtelten Klammern zu schreiben war, kann man mit dplyr einfach der Intuition folgen und einen Schritt nach dem nächsten machen.

Ein einfaches Beispiel: Ich möchte alle Personen eines Datensatzes entsprechend der Kategorien einer Variable in Gruppen einteilen (z.B. Geschlecht, Einkommensstufe, Postleitzahl) und dann die Anzahl der Personen je Kategorie bestimmen. Ich wähle den dataframe aus und gebe ihn mit dem Pipe-Operator (%>%) an die group_by() Funktion von dplyr weiter. Anschließend wird der gruppierten Datensatz an die summarise()-Funktion weitergegeben, die dann mit n() die Fälle zählt. Das Ergebnis wird als dataframe-Objekt „Beispiel“ abgelegt.

Durch group_by() werden Gruppen erzeugt und mittels summarise() werden die Daten je Gruppe so aggregiert, wie ich angebe. Hier wird nur n() je Gruppe gezählt.

Beispiel<-dataframe %>% 
    group_by(variable) %>% 
    summarise(
      freq = n()
     )

Dies ist natürlich nur Minimalbeispiel. Um einen Eindruck davon zu bekommen, was dplyr alles kann, empfehle ich folgenden Überblick:
Data Wrangling with dplyr und tidyr Cheat Sheet.
95% aller Aufgaben, die das Umformen von Daten, erstellen neuer Variablen, Gruppieren von Daten, Auswählen von Fällen, Variablen, Werten, oder Zusammenfügen von Datensätzen erfordern, können damit erledigt werden. Und: Alle diese Schritte können miteinander kombiniert werden.

dplyr in Funktionen
Versucht man den obigen dplyr-code in eine Funktion zu schreiben, stößt man auf eine Fehlermeldung. Warum? Innerhalb von Funktionen muss eine andere Version der dplyr-Funktionen verwendet werden: Standard evaluation (SE)

Anstatt summarise() ==> summarise_()
Anstatt mutate() ==> mutate_()
Anstatt filter() ==> filter_()
usw.

Zusätzlich zu diesen SE-Versionen von dplyr-Funktionen ist die Übergabe der Input-Objekte an die Funktion leicht unterschiedlich. Entweder müssen die Objekte

als Formel „~ Objekt“
als „quote(Objekt)“
oder als String mit Anführungszeichen “ ‚Objekt‘ „

eingefügt werden. Hadley Wickham, der Programmierer des Pakets, empfiehlt die erste Möglichkeit.

Als Funktion sieht der obige dplyr-code so aus:

test.function <- function(dataframe, variable){
  dataframe %>% 
    group_by_(variable) %>% 
    summarise_(
      freq = ~n()
)
}
#Aufrufen der Funktion für den test_dataframe und die Variable gender
test<-test.function(test_dataframe, ~gender)

Die Funktion „test.function(x,y) kann zwei Input-Objekte annehmen, für die sie den dplyr-code durchführt und das Ergebnis als „test“-dataframe abspeichert.

Komplexeres Beispiel
Es ist möglich, ganze Auswertungsprozeduren als Funktion zusammenzufassen. Der folgende Code führt eine Gewichtung durch, bildet Anteilswerte der Antworten und generiert einen Plot der Daten.

#Funktion zur Berechnung der gewichteten Anteilswerte
weighting.function <- function(dataframe, variable){
  dataframe %>% 
    group_by_(variable) %>% 
    summarise_(
      freq = ~n(),
      freq_weighted = ~sum(weight)                    
    ) %>%
    mutate_(
      perc=~freq/sum(freq)*100,
      perc_weighted=~freq_weighted/sum(freq_weighted)*100
    ) %>%
    gather(key=Gewichtung,value=Wert,perc:perc_weighted)
}

#Ausführen der Funktion
gender_w<-weighting.function(datensatz, ~gender)

# Plotten der Daten mit ggplot2
library("ggplot2")
ggplot(gender_w) + geom_bar(aes(x=gender,y=Wert), stat="identity") + facet_grid(~Gewichtung) +scale_x_discrete(labels=c("männlich","weiblich","NA"))+ ggtitle("Anteilswerte Geschlecht - gewichtet und ungewichtet")

Das Ergebnis:
An diesem Plot sieht man, dass vor der Gewichtung der Anteil der Männer etwas höher war, während bei den gewichteten Daten der Frauenanteil etwas erhöht wird. Da es in der Sozialstruktur Deutschlands etwas mehr Frauen als Männer gibt (hauptsächlich wegen der höheren Lebenserwartung), spiegeln die gewichteten Daten die Grundgesamtheit etwas besser wider.

Hier gibt es Weitere Informationen zu standard evaluation in dplyr
Und über die R-Console gelangt man an die vignette: vignette(„tidy-data“).