When you work with R for some time, you really start to wonder why so many R packages have some kind of pun in their name. Intended or not, the **poLCA **package is one of them. Today i´ll give a glimpse on this package, which doesn´t have to do anything with dancing or nice dotted dresses.

This article is kind of a draft and will be revised anytime.

The „poLCA“-package has its name from „Polytomous Latent Class Analysis“. Latent class analysis is an awesome and still underused (at least in social sciences) statistical method to identify unobserved groups of cases in your data. Polytomous latent class analysis is applicable with categorical data. The unobserved (latent) variable could be different attitude-sets of people which lead to certain response patterns in a survey. In marketing or market research latent class analysis could be used to identify unobserved target-groups with different attitude structures on the basis of their buy-decisions. The data would be binary (not bought, bought) and depending on the products you could perhaps identify a class which chooses the cheapest, the most durable, the most environment friendly, the most status relevant […] product. The latent classes are assumed to be nominal. The poLCA package is not capable of sampling weights, yet.

By the way: There is also another package for latent class models called „lcmm“ and another one named „mclust„.

**What does a latent class analysis try to do?**

A latent class model uses the different response patterns in the data to find similar groups. It tries to assign groups that are „conditional independent“. That means, that inside of a group the correlations between the variables become zero, because the group membership explains any relationship between the variables.

Latent class analysis is different from latent profile analysis, as the latter uses continous data and the former can be used with categorical data.

Another important aspect of latent class analysis is, that your elements (persons, observations) are not assigned absolutely, but on probability. So you get a probability value for each person to be assigned to group 1, group 2, […], group k.

Before you estimate your LCA model you have to choose how many groups you want to have. You aim for a small number of classes, so that the model is still adequate for the data, but also parsimonious.

If you have a theoretically justified number of groups (k) you expect in your data, you perhaps only model this one solution. A typical assumption would be a group that is pro, one group contra and one group neutral to an object. Another, more exploratory, approach would be to compare multiple models – perhaps one with 2 groups, one with 3 groups, one with 4 groups – and compare these models against each other. If you choose this second way, you can decide to take the model that has the most plausible interpretation. Additionally you could compare the different solutions by BIC or AIC information criteria. BIC is preferred over AIC in latent class models, but usually both are used. A smaller BIC is better than a bigger BIC. Next to AIC and BIC you also get a Chi-Square goodness of fit.

I once asked Drew Linzer, the developer of PoLCA, if there would be some kind of LMR-Test (like in MPLUS) implemented anytime. He said, that he wouldn´t rely on statistical criteria to decide which model is the best, but he would look which model has the most meaningful interpretation and has a better answer to the research question.

Latent class models belong to the family of (finite) mixture models. The parameters are estimated by the EM-Algorithm. It´s called EM, because it has two steps: An „E“stimation step and a „M“aximization step. In the first one, class-membership probabilities are estimated (the first time with some starting values) and in the second step those estimates are altered to maximize the likelihood-function. Both steps are iterative and repeated until the algorithm finds the global maximum. This is the solution with the highest possible likelihood. That´s why starting values in latent class analysis are important. I´m a social scientist who applies statistics, not a statistician, but as far as i understand this, depending on the starting values the algorithm can stop at point where one (!) local maximum is reached, but it might not be the „best“ local maximum (the global maximum) and so the algorithm perhaps should´ve run further. If you run the estimation multiple times with different starting values and it always comes to the same solution, you can be pretty sure that you found the global maximum.

**data preparation**

Latent class models don´t assume the variables to be continous, but (unordered) categorical. The variables are not allowed to contain zeros, negative values or decimals as you can read in the poLCA vignette. If your variables are binary 0/1 you should add 1 to every value, so they become 1/2. If you have NA-values, you have to recode them to a new category. Rating Items with values from 1-5 should could be added a value 6 from the NAs.

1 |
mydata[is.na(mydata)] <- 6 |

**Running LCA models**

First you should install the package and define a formula for the model to be estimated.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
install.packages("poLCA") library("poLCA") # By the way, for all examples in this article, you´ll need some more packages: library("reshape2") library("plyr") library("dplyr") library("poLCA") library("ggplot2") library("ggparallel") library("igraph") library("tidyr") library("knitr") # these are the defaults of the poLCA command poLCA(formula, data, nclass=2, maxiter=1000, graphs=FALSE, tol=1e-10, na.rm=TRUE, probs.start=NULL, nrep=1, verbose=TRUE, calc.se=TRUE) #estimate the model with k-classes k<-3 lc<-poLCA(f, data, nclass=k, nrep=30, na.rm=FALSE, Graph=TRUE) |

The following code stems from this article. It runs a sequence of models with two to ten groups. With nrep=10 it runs every model 10 times and keeps the model with the lowest BIC.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# select variables mydata <- data %>% dplyr::select(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c) # define function f<-with(mydata, cbind(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)~1) # #------ run a sequence of models with 1-10 classes and print out the model with the lowest BIC max_II <- -100000 min_bic <- 100000 for(i in 2:10){ lc <- poLCA(f, mydata, nclass=i, maxiter=3000, tol=1e-5, na.rm=FALSE, nrep=10, verbose=TRUE, calc.se=TRUE) if(lc$bic < min_bic){ min_bic <- lc$bic LCA_best_model<-lc } } LCA_best_model |

You´ll get the standard-output for the best model from the poLCA-package:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
Conditional item response (column) probabilities, by outcome variable, for each class (row) $F29_a Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.0413 0.2978 0.1638 0.2487 0.1979 0.0428 0.0078 class 2: 0.0000 0.0429 0.0674 0.3916 0.4340 0.0522 0.0119 class 3: 0.0887 0.5429 0.2713 0.0666 0.0251 0.0055 0.0000 $F29_b Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.0587 0.2275 0.1410 0.3149 0.1660 0.0697 0.0222 class 2: 0.0000 0.0175 0.0400 0.4100 0.4249 0.0724 0.0351 class 3: 0.0735 0.4951 0.3038 0.0669 0.0265 0.0271 0.0070 $F29_c Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.0371 0.2082 0.1022 0.1824 0.3133 0.1365 0.0202 class 2: 0.0000 0.0086 0.0435 0.3021 0.4335 0.1701 0.0421 class 3: 0.0815 0.4690 0.2520 0.0903 0.0984 0.0088 0.0000 $F27_a Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.7068 0.2373 0.0248 0.0123 0.0000 0.0188 0.0000 class 2: 0.6914 0.2578 0.0128 0.0044 0.0207 0.0085 0.0044 class 3: 0.8139 0.1523 0.0110 0.0000 0.0119 0.0000 0.0108 $F27_b Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.6198 0.1080 0.0426 0.0488 0.1226 0.0582 0.0000 class 2: 0.6336 0.1062 0.0744 0.0313 0.1047 0.0370 0.0128 class 3: 0.7185 0.1248 0.0863 0.0158 0.0325 0.0166 0.0056 $F27_e Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) class 1: 0.6595 0.1442 0.0166 0.0614 0.0926 0.0062 0.0195 class 2: 0.6939 0.1474 0.0105 0.0178 0.0725 0.0302 0.0276 class 3: 0.7869 0.1173 0.0375 0.0000 0.0395 0.0000 0.0188 $F09_a Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) class 1: 0.8325 0.1515 0.0000 0.0000 0.0160 0.0000 class 2: 0.1660 0.3258 0.2448 0.1855 0.0338 0.0442 class 3: 0.1490 0.2667 0.3326 0.1793 0.0575 0.0150 $F09_b Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) class 1: 0.8116 0.1594 0.0120 0.0069 0.0000 0.0101 class 2: 0.0213 0.3210 0.4000 0.2036 0.0265 0.0276 class 3: 0.0343 0.3688 0.3063 0.2482 0.0264 0.0161 $F09_c Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) class 1: 0.9627 0.0306 0.0067 0.0000 0.0000 0.0000 class 2: 0.1037 0.4649 0.2713 0.0681 0.0183 0.0737 class 3: 0.1622 0.4199 0.2338 0.1261 0.0258 0.0322 Estimated class population shares 0.2792 0.4013 0.3195 Predicted class memberships (by modal posterior prob.) 0.2738 0.4055 0.3206 ========================================================= Fit for 3 latent classes: ========================================================= number of observations: 577 number of estimated parameters: 155 residual degrees of freedom: 422 maximum log-likelihood: -6646.732 AIC(3): 13603.46 BIC(3): 14278.93 G^2(3): 6121.357 (Likelihood ratio/deviance statistic) X^2(3): 8967872059 (Chi-square goodness of fit) |

**Generate table showing fitvalues of multiple models**

Now i want to build a table for comparison of various model-fit values like this:

1 2 3 4 5 6 7 |
Modell log-likelihood resid. df BIC aBIC cAIC likelihood-ratio Entropy Modell 1 -7171.940 526 14668.13 14046.03 14719.13 7171.774 - Modell 2 -6859.076 474 14373.01 14046.03 14476.01 6546.045 0.86 Modell 3 -6646.732 422 14278.93 13786.87 14433.93 6121.357 0.879 Modell 4 -6528.791 370 14373.66 13716.51 14580.66 5885.477 0.866 Modell 5 -6439.588 318 14525.86 13703.64 14784.86 5707.070 0.757 Modell 6 -6366.002 266 14709.29 13721.99 15020.29 5559.898 0.865 |

This table was build by the following code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
#select data mydata <- data %>% dplyr::select(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c) # define function f<-with(mydata, cbind(F29_a,F29_b,F29_c,F27_a,F27_b,F27_e,F09_a, F09_b, F09_c)~1) ## models with different number of groups without covariates: set.seed(01012) lc1<-poLCA(f, data=mydata, nclass=1, na.rm = FALSE, nrep=30, maxiter=3000) #Loglinear independence model. lc2<-poLCA(f, data=mydata, nclass=2, na.rm = FALSE, nrep=30, maxiter=3000) lc3<-poLCA(f, data=mydata, nclass=3, na.rm = FALSE, nrep=30, maxiter=3000) lc4<-poLCA(f, data=mydata, nclass=4, na.rm = FALSE, nrep=30, maxiter=3000) lc5<-poLCA(f, data=mydata, nclass=5, na.rm = FALSE, nrep=30, maxiter=3000) lc6<-poLCA(f, data=mydata, nclass=6, na.rm = FALSE, nrep=30, maxiter=3000) # generate dataframe with fit-values results <- data.frame(Modell=c("Modell 1"), log_likelihood=lc1$llik, df = lc1$resid.df, BIC=lc1$bic, ABIC= (-2*lc1$llik) + ((log((lc1$N + 2)/24)) * lc1$npar), CAIC = (-2*lc1$llik) + lc1$npar * (1 + log(lc1$N)), likelihood_ratio=lc1$Gsq) results$Modell<-as.integer(results$Modell) results[1,1]<-c("Modell 1") results[2,1]<-c("Modell 2") results[3,1]<-c("Modell 3") results[4,1]<-c("Modell 4") results[5,1]<-c("Modell 5") results[6,1]<-c("Modell 6") results[2,2]<-lc2$llik results[3,2]<-lc3$llik results[4,2]<-lc4$llik results[5,2]<-lc5$llik results[6,2]<-lc6$llik results[2,3]<-lc2$resid.df results[3,3]<-lc3$resid.df results[4,3]<-lc4$resid.df results[5,3]<-lc5$resid.df results[6,3]<-lc6$resid.df results[2,4]<-lc2$bic results[3,4]<-lc3$bic results[4,4]<-lc4$bic results[5,4]<-lc5$bic results[6,4]<-lc6$bic results[2,5]<-(-2*lc2$llik) + ((log((lc2$N + 2)/24)) * lc2$npar) #abic results[3,5]<-(-2*lc3$llik) + ((log((lc3$N + 2)/24)) * lc3$npar) results[4,5]<-(-2*lc4$llik) + ((log((lc4$N + 2)/24)) * lc4$npar) results[5,5]<-(-2*lc5$llik) + ((log((lc5$N + 2)/24)) * lc5$npar) results[6,5]<-(-2*lc6$llik) + ((log((lc6$N + 2)/24)) * lc6$npar) results[2,6]<- (-2*lc2$llik) + lc2$npar * (1 + log(lc2$N)) #caic results[3,6]<- (-2*lc3$llik) + lc3$npar * (1 + log(lc3$N)) results[4,6]<- (-2*lc4$llik) + lc4$npar * (1 + log(lc4$N)) results[5,6]<- (-2*lc5$llik) + lc5$npar * (1 + log(lc5$N)) results[6,6]<- (-2*lc6$llik) + lc6$npar * (1 + log(lc6$N)) results[2,7]<-lc2$Gsq results[3,7]<-lc3$Gsq results[4,7]<-lc4$Gsq results[5,7]<-lc5$Gsq results[6,7]<-lc6$Gsq |

Now i calculate the Entropy (a pseudo-r-squared) for each solution. I took the idea from Daniel Oberski´s Presentation on LCA.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
entropy<-function (p) sum(-p*log(p)) results$R2_entropy results[1,8]<-c("-") error_prior<-entropy(lc2$P) # class proportions model 2 error_post<-mean(apply(lc2$posterior,1, entropy),na.rm = TRUE) results[2,8]<-round(((error_prior-error_post) / error_prior),3) error_prior<-entropy(lc3$P) # class proportions model 3 error_post<-mean(apply(lc3$posterior,1, entropy),na.rm = TRUE) results[3,8]<-round(((error_prior-error_post) / error_prior),3) error_prior<-entropy(lc4$P) # class proportions model 4 error_post<-mean(apply(lc4$posterior,1, entropy),na.rm = TRUE) results[4,8]<-round(((error_prior-error_post) / error_prior),3) error_prior<-entropy(lc5$P) # class proportions model 5 error_post<-mean(apply(lc5$posterior,1, entropy),na.rm = TRUE) results[5,8]<-round(((error_prior-error_post) / error_prior),3) error_prior<-entropy(lc6$P) # class proportions model 6 error_post<-mean(apply(lc6$posterior,1, entropy),na.rm = TRUE) results[6,8]<-round(((error_prior-error_post) / error_prior),3) # combining results to a dataframe colnames(results)<-c("Model","log-likelihood","resid. df","BIC","aBIC","cAIC","likelihood-ratio","Entropy") lca_results<-results # Generate a HTML-TABLE and show it in the RSTUDIO-Viewer (for copy & paste) view_kable <- function(x, ...){ tab <- paste(capture.output(kable(x, ...)), collapse = '\n') tf <- tempfile(fileext = ".html") writeLines(tab, tf) rstudio::viewer(tf) } view_kable(lca_results, format = 'html', table.attr = "class=nofluid") # Another possibility which is prettier and easier to do: install.packages("ztable") ztable::ztable(lca_results) |

**Elbow-Plot**

Sometimes, an Elbow-Plot (or Scree-Plot) can be used to see, which solution is parsimonius and has good fit-values. You can get it with this ggplot2-code i wrote:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# plot 1 # Order categories of results$model in order of appearance install.packages("forcats") library("forcats") results$model < - as_factor(results$model) #convert to long format results2<-tidyr::gather(results,Kriterium,Guete,4:7) results2 #plot fit.plot<-ggplot(results2) + geom_point(aes(x=Model,y=Guete),size=3) + geom_line(aes(Model, Guete, group = 1)) + theme_bw()+ labs(x = "", y="", title = "") + facet_grid(Kriterium ~. ,scales = "free") + theme_bw(base_size = 16, base_family = "") + theme(panel.grid.major.x = element_blank() , panel.grid.major.y = element_line(colour="grey", size=0.5), legend.title = element_text(size = 16, face = 'bold'), axis.text = element_text(size = 16), axis.title = element_text(size = 16), legend.text= element_text(size=16), axis.line = element_line(colour = "black")) # Achsen etwas dicker # save 650 x 800 fit.plot |

**Inspect population shares of classes**

If you are interested in the population-shares of the classes, you can get them like this:

1 2 |
round(colMeans(lc$posterior)*100,2) [1] 27.92 40.13 31.95 |

or you inspect the estimated class memberships:

1 2 3 4 5 6 7 |
table(lc$predclass) 1 2 3 158 234 185 round(prop.table(table(lc$predclass)),4)*100 1 2 3 27.38 40.55 32.06 |

**Ordering of latent classes**

Latent classes are unordered, so which latent class becomes number one, two, three… is arbitrary. The latent classes are supposed to be nominal, so there is no reason for one class to be the first. You can order the latent classes if you must. There is a function for manually reordering the latent classes: poLCA.reorder()

First, you run a LCA-model, extract the starting values and run the model again, but this time with a manually set order.

1 2 3 4 5 6 7 8 9 10 11 12 |
#extract starting values from our previous best model (with 3 classes) probs.start<-lc3$probs.start #re-run the model, this time with "graphs=TRUE" lc<-poLCA(f, mydata, nclass=3, probs.start=probs.start,graphs=TRUE, na.rm=TRUE, maxiter=3000) # If you don´t like the order, reorder them (here: Class 1 stays 1, Class 3 becomes 2, Class 2 becomes 1) new.probs.start<-poLCA.reorder(probs.start, c(1,3,2)) #run polca with adjusted ordering lc<-poLCA(f, mydata, nclass=3, probs.start=new.probs.start,graphs=TRUE, na.rm=TRUE) lc |

Now you have reordered your classes. You could save these starting values, if you want to recreate the model anytime.

1 |
saveRDS(lc$probs.start,"/lca_starting_values.RData") |

**Plotting**

This is the poLCA-standard Plot for conditional probabilites, which you get if you add „graph=TRUE“ to the poLCA-call.

It´s in a 3D-style which is not really my taste. I found some code at dsparks on github or this Blog, that makes very appealing ggplot2-plots and did some little adjustments.

1 2 3 4 5 6 7 8 9 10 11 |
lcmodel <- reshape2::melt(lc$probs, level=2) zp1 <- ggplot(lcmodel,aes(x = L1, y = value, fill = Var2)) zp1 <- zp1 + geom_bar(stat = "identity", position = "stack") zp1 <- zp1 + facet_grid(Var1 ~ .) zp1 <- zp1 + scale_fill_brewer(type="seq", palette="Greys") +theme_bw() zp1 <- zp1 + labs(x = "Fragebogenitems",y="Anteil der Item-\nAntwortkategorien", fill ="Antwortkategorien") zp1 <- zp1 + theme( axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major.y=element_blank()) zp1 <- zp1 + guides(fill = guide_legend(reverse=TRUE)) print(zp1) |

If you want to compare the items directly:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
zp2 <- ggplot(lcmodel,aes(x = Var1, y = value, fill = Var2)) zp2 <- zp2 + geom_bar(stat = "identity", position = "stack") zp2 <- zp2 + facet_wrap(~ L1) zp2 <- zp2 + scale_x_discrete("Fragebogenitems", expand = c(0, 0)) zp2 <- zp2 + scale_y_continuous("Wahrscheinlichkeiten \nder Item-Antwortkategorien", expand = c(0, 0)) zp2 <- zp2 + scale_fill_brewer(type="seq", palette="Greys") + theme_bw() zp2 <- zp2 + labs(fill ="Antwortkategorien") zp2 <- zp2 + theme( axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major.y=element_blank()#, #legend.justification=c(1,0), #legend.position=c(1,0) ) zp2 <- zp2 + guides(fill = guide_legend(reverse=TRUE)) print(zp2) |

Joachim

Amazing! Danke, das ist wirklich hilfreich!

Eiko Fried

Very cool, including the ggplot2 hack, although I think the 3d plots are (considering they are 3d plots) reasonably interpretable actually.

Niels

Thank you very much for your comment! Of course you´re right, the default-plot is fine. It´s just that i usually try to avoid 3D-visualisations, because they might bias the perception of the bars´ heights.

David Henderson

Amazing stuff! Thanks for this – will help me a lot! I especially like the ggplot!

Lian

Thanks for this code, it’s super helpful!

Ana

¡Muchísimas gracias! Greetings from Mexiko

Ana

Do you know how to add a variable that affects the manifest variables in the poLCA package? Is it possible?

Niels

Hey Ana,

sorry for my late reply. Yes, it´s possible to do mixture models with covariates or a mixture regression model in poLCA. In the example above i define the model like this:

# define function

f< -with(mydata, cbind(var1,var2,var3...)~1)

But you can also add predictor-variables:

f< -with(mydata, cbind(Var1,var2,var3...) ~ predictor1 + predictor2 * predictor3)

lca_model< -poLCA(f, data=data, nclass = 4, nrep=10)

You can find more information in the poLCA vignette (https://cran.r-project.org/web/packages/poLCA/poLCA.pdf), or in the book „Latent Variable Modeling with R“ by W. Holmes Finch,Brian F. French.

I hope this gives you a starting point =) Best regards

Niels

Ana

Thank you so much.

I thought that i needed covLCA for what I am asking, something that affects the manifests variables but not the latent variable. I am building something across time and I need to make like a group or something for the years of the survey (because I have different sizes of samples). But, covLCA did not work. I am still looking.

Best!

Niels

Are you trying to do a latent class growth analysis? I think the OpenMX-package can do this.

Simon

This is super helpful for a beginner. Thank you Niels!

Btw, for those who wish to play around with these codes: because the first column of results2 is named „Model“, the codes in the elbow plot need to refer to „Model“ (not the „Modell“).

Niels

Hi Simon,

thank you for your comment! You´re right, i changed it to „model“ now. I´m glad you found it useful. Best, Niels

dian

Dear Simon,

I am analyzing data with poLCA right now, and your article really precious.

but i have a problem when i tried to immitate the elbow-plot, mine are 10class and Model 10 comes after Model 1, so it was not respectively appeared in the plot..(1,10,2,3,4,5,6,7,8,9). Do you have any sugesstion how to fix this problem?

thanks a lot..

Dian

Niels

Hi Dian,

could you post your code somewhere? Then i would have a look where it went wrong.

Best

Niels

dian

Hi Niels,

sorry I called you Simon..

exactly as your example:

dankuwel,

dian–

Niels

Hey Dian,

that problem occured because „results2$model“ was of type character and not a factor. Thanks to Hadley Wickhams package forcats() we can convert results2$model to a factor where the categories are in the order of appearance.

`# Order categories in order of appearance`

install.packages("forcats")

library("forcats")

results$model < - as_factor(results$model) #convert dataframe to long-format results2<-tidyr::gather(results,Kriterium,Guete,4:7) results2

And then you can just plot it.

dian

Wow..that is help me a lot!

thanks again.

by the way..is it normal to have entropy results as NaN? it was only appeared for model 2 and model 3 (0.991 and 0.998) but not of other models.

best,

dian

Niels

I don´t think that´s normal.

I tried it with the example-data i just posted and get entropy results for model 2 – 10. If your code is exactly the same as mine, i think there´s some data related problem. Perhaps having a look at lc4$P (the class proportions) and lc4$posterior will help figuring this out.

best

dian

Hi Niels,

> lc2$P

[1] 0.8564319 0.1435681

> lc3$P

[1] 0.4719811 0.1431164 0.3849026

> lc4$P

[1] 0.36169381 0.49220317 0.04778167 0.09832134

> lc5$P

[1] 0.04779091 0.09832134 0.46795171 0.35875165 0.02718438

I don’t know where is the problem comes from, because class 2 and 3 got the results.

best,

dian

dian

Sorry, one more thing:

you used package „rstudio“ to make copy-able lcaresults table, but here in R 3.2.4 this program no longer exist. Do you have any alternative?

thanks a lot!

Niels

I just tested ztable with R and it worked for me.

install.packages(„ztable“)

library(„ztable“)

ztable(yourdataframe)

Sofiati dian

Thank A lot Niels,

Best,

Dian

Daniel E. Weeks

Thank you for sharing this helpful code.

This line:

lcmodel <- reshape2::melt(lc$probs)

should be:

lcmodel <- reshape2::melt(lc$probs, level=2)

in order to maintain the class labels.

And in this part:

results <- data.frame(Modell=c("Modell 1"),

log_likelihood=lc1$llik,

df = lc1$resid.df,

BIC=lc1$bic,

ABIC= (-2*lc2$llik) + ((log((lc2$N + 2)/24)) * lc2$npar),

CAIC = (-2*lc1$llik) + lc1$npar * (1 + log(lc1$N)),

likelihood_ratio=lc1$Gsq)

the ABIC value should be computed using lc1 values instead of lc2 values.

Niels

Wow! Thank you, Daniel, this error completely slipped through! I copy/pasted that line from another part of my code.

best

Niels

Fizaine Florian

Hello Niels,

Thank you very much for sharing your knowledge on this blog. It helps me to save time.

Now, I am wondering wthether it possible to introduce both categorical (dichotomous) data and continuous variables in latent class analysis. If I correctly understood your draft, poLCA cannot do this.

Best,

Florian

Niels

Hi Florian,

you´re right, poLCA can´t use continous data. It´s only suitable for latent class analysis (observed categorical variables, unobserved categorical variables), not latent profile analysis (observed continous variables, unobserved categorical variables). Latent profile analysis has the same aim as latent class analysis: Finding unobserved segments of cases. But different to latent class analysis the observed values aren´t categorical but continous.

Of course, if you want to stay with poLCA, you could cut you continous data in categories of equal size, but that would mean a loss of information. So, i think you would have to switch to mclust() or another package that is capable of that.

This paper might be a good starting point: Oberski: Mixture models: latent profile and latent class analysis

Malte Hückstädt

Hallo Niels!

Danke für den fabelhaften Code, sehr hilfreich für R-Novizen:)

Eine Frage: Wäre es möglich in die ggplot-Plots noch die jeweiligen, prozentualen Anteile einzuzeichnen. Ich selbst versuche seit Stunden herum, komme aber nicht weiter.

Könntest Du mir also vielleicht, aufbauend auf Deiner Syntax, helfen?

Viele Grüße und Dank im Voraus,

Malte

Niels

Hi Malte!

Danke für das Lob. Klar, kann man die Werte auch noch anzeigen. Allerdings kann es dadurch recht unübersichtlich werden. Ich persönlich nutze für die konkreten Werte noch eine ergänzende Tabelle. Aber zur Frage: Ich bin gerade nur am Handy, deshalb kann ich gerade nicht präzise werden, aber „+ geom_text(aes(label=value), colour=“red“) “ könnte Dir schonmal die Richtung weisen. Man muss sicher noch mit vjust oder hjust spielen, um die Position richtig hinzubekommen. Meld dich nochmal, obs geklappt hat, sonst gucken wir das nochmal zusammen an. Viele Grüße!

Malte Hückstädt

Hi Niels! Ganz herzlichen Dank für die rasche Antwort.

Ja, hat tatsächlich sogleich prima geklappt! Allerdings rutschen die Labels (wie von Dir angenommen) noch etwas durcheinander. Es geht aber schon sehr in die richtige Richtung. Vielleicht bekomme ich den Rest nun alleine hin.

Könntest Du mir vielleicht bei Gelegenheit noch verraten, wie Du die „ergänzende Tabelle“ gebaut hast. Hast Du vielleicht eine passende Syntax, anhand der ich mich orientieren könnte?

Vielleicht sollten wir ab hier auf Mail umsteigen. Will Dir hier keinesfalls Deinen Kommentarbereich vermüllen.

Viele Grüße,

Malte

Niels

Ok, falls du Twitter hast, könntest du mich auch da anschreiben. Als Tabelle nehme ich die poLCA-Standardtabelle der konditionalen Wahrscheinlichkeiten (also der Wahrscheinlichkeit, dass eine willkürliche Person einer Klasse eine bestimmte Antwortkategorie eines Items gewählt hat) und packe sie einfach als dataframe in ztable::ztable(Tabelle). Normalerweise kopiere ich das dann zum Formatieren in Excel, da ich momentan mit Word arbeite.

Malte

Hallo Niels, eine Frage noch zur einer Deiner Syntax-Zeilen.

Durch die Formatumwandlung und die Auswahl der „probs“ (lcmodel <- reshape2::melt(lc$probs, level=2)) dezimiert sich meine Fallzahl hier bei mir um mehr als die Hälfte.

Hat das evtl. mit der Klassen-Aggregierung zu tun?

Viele Grüße + Dank,

Malte

PS: Hab Dich auf Twitter kontaktiert. Wenn es Dir lieber ist, schreiben wir einfach dort weiter.

Cecilia

Do you mind posting the example dataset used here? Might help repeat what you’ve done step-by-step. Thanks

SIVA

Hi Neils,

wonderful code !!! thanks 🙂

I want to identify from LCA model output , which observation/record belongs to which class ?

How do i do that

thanks again

Niels

You can access this information with

`lc$predclass`

where lc is the LCA-Model you estimated via poLCA().BUT: Be aware that LCA is a probabilistic way assigning the classes. Every observation has a probability to belong to each of the classes, which you can inspect with

`lc$posterior`

. The lc$predclass-thing just assigns the class with the highest probability. Bonus-Info: Thats why, if you calculate the shares of the predicted classes (`prop.table(table(lc$predclass))`

they will differ from the estimated population share (`colMeans(lc$posterior)*100`

.SIVA

thanks Neils !!!

I appreciate your help:).

can u tell how to validate LCA model,

i have split the original data 80:20 (train :test )…

built model on train data

which function in poLCA does this purpose….

also wat would ensure classes formed would be stable???

Niels

Sorry for my late reply. In my blogpost i recommend to estimate each model multiple times with different starting values so that you can be pretty sure that the algorithm found the best solution. The precision of the classification can be inspected through the entropy-statistic. It is near zero if the classification is no better than random with 1 beeing the opposite. Which model describes your data the best depends on model fit criteria and if the classes make sense (interpretation). There is also k-fold-cross-validation for this purpose.

Cross Validation could be used for your problem as well. If the model you estimated with your randomly selected training data gets a comparable good fit with the test data, the model should be appropriate here as well. There is also two-fold (or k-fold) crossvalidation you could have a look at. Would love to hear feedback on how you solved your problem.

Chao Liu

Hi Niels,

Thank you for your wonderful codes and it has been very helpful for me! I do have a question: in your post I assume you used observed categorical variables in the analysis, so how about unobserved variables?

For example, I have multiple categorical outcome variables (y1, y2, y3, …, y27) and I want to group them into 6 factors (f1, f2, …, f6). I know I can create sum or average scores of these outcome variables but that will make me lose the variances. So how can poLCA handle these unobserved factors? Can you provide some codes?

Thank you so much in advance!

Chao

Niels

Hi Chao,

i´m not sure i understood your question completely. A Latent class analysis tries to find subtypes of related cases in that way, that the assigned group explains the correlations among the observed variables. While an exploratory factor analysis tries to find

variablesthat belong together (i.e. you have a scale of 10 variables and PCA finds 2 principal components explaining X % of the variance), LCA tries to findcasesthat belong together. I just remembered this website, that explains how LCA compares to other methods http://www.john-uebersax.com/stat/faq.htm#otherm.Perhaps i don´t get this right, but to me, this sounds more like an factor analysis approach, where you make average scores of all variables of one factor. In latent class analysis you have the conditional class probabilities, where each observation has a probability to belong to each group. Perhaps you can help me understand your goal a bit better 🙂

Chao Liu

Hi Niels,

Thank you for your response and sorry for any confusion. What I am trying to do is more like a two-step procedure: first a factor analysis and second a latent class analysis on those factors derived from the first step. So I wonder if there is any way that I can take those factors derived from factor analysis and put them into the latent class analysis. I thought about using lavaan to run a cfa first and take the resulted factors from lavaan to run a latent class analysis in poLCA but couldn’t figure out how. I hope this makes sense to you. Maybe I totally misunderstood what LCA is capable of doing but does it have to run on manifest variables?

Thanks again for your help!

Chao

Niels

Hi Chao,

thanks, that cleared things up for me. First: Yes, LCA runs on categorical manifest variables and tries to find the values of categorical latent variables.

Considering your goal, i have some ideas, but they are more or less quesses what you could do and i´m not sure if they are methodologically adequate. Ok, so you want to reduce the data through factor analysis and then run a LCA to see the structure of your cases. The problem is, that factors are continous and an LCA uses categorical data, so a latent profile analysis would be more adequate. In latent profile analysis the observed data is continous and the latent is categorical. PoLCA is only capable of LCA. But you could also use cut() and make categories from the continous data, loosing some information, of course.

In the first step, you would fit the cfa()-model, then you would save the predicted factor scores (Perhaps like this https://groups.google.com/d/msg/lavaan/E4NPoUiKsks/5IYLv5ggAAAJ), make them categorical and then run the LCA on them. But i´m not sure about the interpretation of the results.

Chao Liu

Hi Niels,

Thank you for your suggestion! I will try the approach you provided. Just a quick follow-up question: my observed variables are categorical and I guess factor analysis will make them continuous in the end?

Chao

Niels

Hi, as you surely know, factor analysis normally needs continuous data. But in the social sciences it is often ok to use items that have equidistant scales with at least 5 categories (likert-type items). There are also other assumptions like normality. Latent class analysis has the advantage of beeing a nonparametric method without such assumptions. TL;DR: In a way your categorical data will lead to continous factors, yeah 🙂

Best,

Niels

Chao Liu

Thank you Niels!

Best,

Chao

Sofiati Dian

Hi Niels,

Recently I tried to do LCA in Latent Gold, and there are several outputs, such as p value (to measure the difference between model and our data–when we perform goodness of fit test, such as likelihood ratio) that not produce in poLCA output.

Also, for example, when I perform 2 class analysis and we see the lc2$predcell results, we will see the observed vs expected value for each combination of variables (ie: F09_a (1), F09_b (1), F09_c(1), F27_a (1), F27_e(1), F27_e(1), F29_a (1), F29_b(1), F29_c(1), but there is no two additional information just like Latent Gold provide which is the probability of that combination belong to class 1 or 2. Do you know how to produce this probability value?

Thank you very much, I learn a lot from this article

Dian–

Niels

Hi Dian, i’m afraid poLCA doesn’t do likelihood ratio tests. I know MPlus and latent gold have them to compare models, but poLCA lacks this function. I guess this can be done somehow, but haven’t seen it on the internet, yet. Sorry.

Ozzy

Thank you Niels! Any suggestion on multilevel multinomial logistic regression model- is there any R package that can be used to build a two-level logistic regression model?

J

Niels

Hi Ozzy, you should check out the glmer() function from the lme4-package. It’s capable of that kind of model (as long as you’re not modeling any latent variables). Does this help?

Hanan

Hey Niels,

This is really awesome!

I did a latent class analysis, which gave a best fit for 3 classes. Now I want to use these classes and do a multinomial logistic regression. I read that I have to use flexmix package. But I am not sure really how that works.

Do you have any tips?

Or is poLCA package sufficient enough to do a regression analysis with the 3 classes by adding the covariates?

I want to add demographical variables to the classes, so i can see which X variable predicts the membership of a certain class.

Also I have to compare classes which each other? How do I do that?

Best,

Hanan

Niels

Hi,

here´s what the vignette says about it (https://cran.r-project.org/web/packages/poLCA/poLCA.pdf):

„The term „Latent class regression“ (LCR) can have two meanings. In this package, LCR models refer to latent class models in which the probability of class membership is predicted by one or more covariates. However, in other contexts, LCR is also used to refer to regression models in which the manifest variable is partitioned into some specified number of latent classes as part of estimating the regression model. It is a way to simultaneously fit more than one regression to the data when the latent data partition is unknown. The flexmix function in package flexmix will estimate this other

type of LCR model. Because of these terminology issues, the LCR models this package estimates are sometimes termed „latent class models with covariates“ or „concomitant-variable latent class analysis,“ both of which are accurate descriptions of this model.“

Hanan

Thank you so much for the link.

I was thinking instead of doing a regression,

I could simply compute two-way tables summarizing the class membership probabilities

per covariate category (e.g., for males and females, for educational levels, for

age groups).

Do you have an example of how to do that?

Ghanendra

Very detailed explanation .

Just wanted to know where exactly we assign the classes to the individual observations(response level)

Niels

You can find this information in lc$predclass. It´s a „vector of predicted class memberships, by modal assignment“ (polca vignette https://cran.r-project.org/web/packages/poLCA/poLCA.pdf)

Ozzy

Hi Niels,

Any suggestion on how to compute variable specific entropy contribution in order to identify relatively more informative indicators in forming latent classes?

Thank you!

Niels

Mhm, interesting… do have you a source, where you read about this analyis-step? You could compare entropy of two models, where one model omits a variable. But my approach is mostly to have a latent class model that makes sense from a theoretical point of view and criteria like maximization of entropy are of less importance to me. This seems to be different in your case, but i´m afraid i´m not of use here.

zhang

Hi Niels,

Can I have you email address? I am very interested in your code and want to collabrate with you in a paper. I am a clinician and interested in using poLCA to classify my heterogeneous patient population and I found your code are very helpful to me.

I am looking forward to your positive response.

Mahmoud Slim

Thanks a lot Niels! Great work!

Ethan Chen

Hi Niels,

Thanks for such a useful article! I am trying to implement latent instrumental variable method (e.g., Ebbes et al. 2005). I am trying to partition a discrete variable X into two parts: (1) the unobserved discrete variable Z that separates the sample into m groups, where m > 1); and the error term e. I am wondering if polca is the right package to go (or probably flexmix?) Could you please also share your thoughts on how to get the predicted value Z hat and the residuals e hat for each observation? Thanks a lot.

Niels

Hi Ethan,

sorry for my late answer. I´m not familiar with the instrumental variable method in latent class analysis, sorry. Also, there usually are no error terms for LCA. I skimmed over the paper by Ebbes et al. 2005, that you mentioned. It´s on regression with latent variables (SEM). If you want to use instrumental variables in a SEM, you´ll find material online.

But perhaps i didn´t get your question right 😉 A good idea could be, to post it on stackexchange.com. I´m sure somebody can help you there. But i can already tell you, that polca is probably not the right tool for your task. Sorry.

olgun

Hi Niels, I am doing LCA analysis with PoLCA, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: „ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND“. So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:

for(i in 2:50){

lc <- poLCA(f, data, nclass=i, maxiter=500,

tol=1e-5, na.rm=FALSE,

nrep=10, verbose=TRUE, calc.se=TRUE)

if(lc$bic < min_bic){

min_bic <- lc$bic

LCA_best_model LCA_best_model

and the last model output:

Hi Niels, I am doing LCA analysis with PoLCA, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: „ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND“. So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:

for(i in 2:50){

lc <- poLCA(f, data, nclass=i, maxiter=500,

tol=1e-5, na.rm=FALSE,

nrep=10, verbose=TRUE, calc.se=TRUE)

if(lc$bic < min_bic){

min_bic <- lc$bic

LCA_best_model LCA_best_model

and the last model output:

Hi Niels, I am doing LCA analysis with PoLCA, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: „ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND“. So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number for up to 50 classes. The R script i use to find the best model in LCA and one of the outputs is as follows:

for(i in 2:50){

lc <- poLCA(f, data, nclass=i, maxiter=500,

tol=1e-5, na.rm=FALSE,

nrep=10, verbose=TRUE, calc.se=TRUE)

if(lc$bic < min_bic){

min_bic <- lc$bic

LCA_best_model LCA_best_model

and the last model output is:

Fit for 35 latent classes:

number of observations: 36036

number of estimated parameters: 2029

residual degrees of freedom: 34007

maximum log-likelihood: -482547.1

AIC(35): 969152.2

BIC(35): 986383

G^2(35): 233626.8 (Likelihood ratio/deviance statistic)

X^2(35): 906572555 (Chi-square goodness of fit)

ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND

olgun

Hi Niels, I am doing LCA analysis with PoLCA, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: „ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND“. So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number for up to 50 classes. The R script i use to find the best model in LCA and one of the outputs is as follows:

for(i in 2:50){

lc <- poLCA(f, data, nclass=i, maxiter=500,

tol=1e-5, na.rm=FALSE,

nrep=10, verbose=TRUE, calc.se=TRUE)

if(lc$bic < min_bic){

min_bic <- lc$bic

LCA_best_model LCA_best_model

and the last model output is:

Fit for 35 latent classes:

number of observations: 36036

number of estimated parameters: 2029

residual degrees of freedom: 34007

maximum log-likelihood: -482547.1

AIC(35): 969152.2

BIC(35): 986383

G^2(35): 233626.8 (Likelihood ratio/deviance statistic)

X^2(35): 906572555 (Chi-square goodness of fit)

ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND

Niels

Hey olgun, thank you for reading my blog 🙂 Ok, my first thought was if you really need models with 50 latent classes. Even if i don’t know the variables, that seems like overkill. I think if you‘re doing the analysis to reduce complexity, you should use less classes and look for a solution that is still interpretable. But of course i don‘t know the specific task you‘re trying to accomplish. But still, try less classes and give polca more iterations. More iterations means longer processing time, but that‘s the only way for the algorithm to find the local minima, which is needed for a stable result. You can also reduce nrep to 5 to save some of the time you invest in the iterations. Your code seems to be cut off by wordpress, if you like to, post it in another comment. Best luck! Niels

Jur

Hi Niels, thanks for your clear explanation of this poLCA package! I have a question regarding the modelspecification. In my model I have 4 manifest variables, so my formula would be:

f <- cbind(mv1,mv2,mv3,mv4)~1. This works, but I think that the first variable has more value to the segmentation than the last variable. Is there a way to increase or decrease value to the seperate variables? For example: mv1*5 or mv4*0.5. Thanks in advance!

Niels

I´m sorry, i don´t know. But if you find an answer to your question, perhaps others would be thankful if you post your answer here. Best!

Kevin Wolff

Hi Niels,

This is a great page and your assistance above is generous and helpful to all of us that are new to this. The automation of the results is especially awesome.

I apologize if this is silly question, but I am a novice at LCA and have been asked to present Lo-Mendell-Rubin likelihood ratio tests. I didn’t see a reference to them on your page, Is it possible to obtain these results from poLCA? Or is there a computation I can do using the obtained results? I see a lot of resources to compute these LR tests in MPLus, but was hoping to use what I learned from you above.

Thanks for any input/advice.

Best,

Kevin

Niels

Hi Kevin,

thanks for your kind words ☺️ . I should do a remake of that blogpost, because the code can be made much better ;). Your question isn‘t silly at all! I had the same question and as you say, many other tools that can do LCA provide this test. The problem is: poLCA doesn‘t. I‘m not sure if there is any other R-package that can do it. Mplus and latent Gold certainly provide the test. poLCA isn‘t developed any further, so if you don‘t code the test yourself with help from a statistics textbook, i‘m afraid you have to switch.

Kevin Wolff

Niels,

Thanks for saving me some time trying to find a way to do it with poLCA. I appreciate your help and timely response. Again, great post for novices. Keep up the good work!

KW