binGroup2: Identification and Estimation using Group Testing

Methods for the group testing identification and estimation problems.

Details

Methods for identification of positive items in group testing designs: Operating characteristics (e.g., expected number of tests) are calculated for commonly used hierarchical and array-based algorithms. Optimal testing configurations for an algorithm can be found as well. Please see Hitt et al. (2019) for specific details.

Methods for estimation and inference for proportions in group testing designs: For estimating one proportion or the difference of proportions, confidence interval methods are included that account for different pool sizes. Functions for hypothesis testing of proportions, calculation of power, and calculation of the expected width of confidence intervals are also included. Furthermore, regression methods and simulation of group testing data are implemented for simple pooling (Dorfman testing with or without retests), halving, and array testing designs.

The binGroup2 package is based upon the binGroup package that was originally designed for the group testing estimation problem. Over time, additional functions for estimation and for the group testing identification problem were included. Due to the diverse styles resulting from these additions, we have created binGroup2 as a way to unify functions in a coherent structure and incorporate additional functions for identification. The binGroup2 package provides all the main functionality from the binGroup package, and can be used in place of the binGroup package. The name “binGroup” originates from the assumption in basic estimation for group testing that the number of positive groups has a binomial distribution. While more advanced estimation methods no longer make this assumption, we continue with the binGroup name for consistency.

Bilder (2019a,b) provide introductions to group testing. These papers and additional details about group testing are available at http://chrisbilder.com/grouptesting/.

This research was supported by the National Institutes of Health under grant R01 AI121351.

Identification

The binGroup2 package focuses on the group testing identification problem using hierarchical and array-based group testing algorithms.

The OTC1 function implements a number of group testing algorithms, described in Hitt et al. (2019), which calculate the operating characteristics and find the optimal testing configuration over a range of possible initial group sizes and/or testing configurations (sets of subsequent group sizes). The OTC2 function does the same with a multiplex assay that tests for two diseases.

The operatingCharacteristics1 (opChar1) and operatingCharacteristics2 (opChar2) functions calculate operating characteristics for a specified testing configuration with assays that test for one and two diseases, respectively.

These functions allow the sensitivity and specificity to differ across stages of testing. This means that the accuracy of the diagnostic test can differ for stages in a hierarchical testing algorithm or between row/column testing and individual testing in an array testing algorithm.

Estimation

The binGroup2 package also provides functions for estimation and inference for proportions in group testing designs.

The propCI function calculates the point estimate and confidence intervals for a single proportion from group testing data. The propDiffCI function does the same for the difference of proportions. A number of confidence interval methods are available for groups of equal or different sizes.

The gtWidth function calculates the expected width of confidence intervals in group testing. The gtTest function calculates p-values for hypothesis tests of single proportions. The gtPower function calculates power to reject a hypothesis.

The designPower function iterates either the number of groups or group size in a one-parameter group testing design until a pre-specified power level is achieved. The designEst function finds the optimal group size corresponding to the minimal mean-squared error of the point estimator.

The gtReg function implements regression methods and the gtSim function simulates group testing data for simple pooling, halving, and array testing designs.

References

Altman, D., Bland, J. (1994). “Diagnostic tests 1: Sensitivity and specificity.” BMJ, 308, 1552.

Altman, D., Bland, J. (1994). “Diagnostic tests 2: Predictive values.” BMJ, 309, 102.

Biggerstaff, B. (2008). “Confidence intervals for the difference of proportions estimated from pooled samples.” Journal of Agricultural, Biological, and Environmental Statistics, 13, 478–496.

Bilder, C., Tebbs, J., Chen, P. (2010). “Informative retesting.” Journal of the American Statistical Association, 105, 942–955.

Bilder, C., Tebbs, J., McMahan, C. (2019). “Informative group testing for multiplex assays.” Biometrics, 75, 278–288.

Bilder, C. (2019a). “Group Testing for Estimation.” Wiley StatsRef: Statistics Reference Online.

Bilder, C. (2019b). “Group Testing for Identification.” Wiley StatsRef: Statistics Reference Online.

Bilder, C., Iwen, P., Abdalhamid, B., Tebbs, J., McMahan, C. (2020). “Tests in short supply? Try group testing.” Significance, 17, 15.

Black, M., Bilder, C., Tebbs, J. (2012). “Group testing in heterogeneous populations by using halving algorithms.” Journal of the Royal Statistical Society. Series C: Applied Statistics, 61, 277–290.

Black, M., Bilder, C., Tebbs, J. (2015). “Optimal retesting configurations for hierarchical group testing.” Journal of the Royal Statistical Society. Series C: Applied Statistics, 64, 693–710.

Graff, L., Roeloffs, R. (1972). “Group testing in the presence of test error; an extension of the Dorfman procedure.” Technometrics, 14, 113–122.

Hepworth, G. (1996). “Exact confidence intervals for proportions estimated by group testing.” Biometrics, 52, 1134–1146.

Hepworth, G., Biggerstaff, B. (2017). “Bias correction in estimating proportions by pooled testing.” Journal of Agricultural, Biological, and Environmental Statistics, 22, 602–614.

Hitt, B., Bilder, C., Tebbs, J., McMahan, C. (2019). “The objective function controversy for group testing: Much ado about nothing?” Statistics in Medicine, 38, 4912–4923.

Hou, P., Tebbs, J., Wang, D., McMahan, C., Bilder, C. (2021). “Array testing with multiplex assays.” Biostatistics, 21, 417–431.

Malinovsky, Y., Albert, P., Roy, A. (2016). “Reader reaction: A note on the evaluation of group testing algorithms in the presence of misclassification.” Biometrics, 72, 299–302.

McMahan, C., Tebbs, J., Bilder, C. (2012a). “Informative Dorfman Screening.” Biometrics, 68, 287–296.

McMahan, C., Tebbs, J., Bilder, C. (2012b). “Two-Dimensional Informative Array Testing.” Biometrics, 68, 793–804.

Schaarschmidt, F. (2007). “Experimental design for one-sided confidence intervals or hypothesis tests in binomial group testing.” Communications in Biometry and Crop Science, 2, 32–40. ISSN 1896-0782.

Swallow, W. (1985). “Group testing for estimating infection rates and probabilities of disease transmission.” Phytopathology, 75, 882–889.

Tebbs, J., Bilder, C. (2004). “Confidence interval procedures for the probability of disease transmission in multiple-vector-transfer designs.” Journal of Agricultural, Biological, and Environmental Statistics, 9, 75–90.

Vansteelandt, S., Goetghebeur, E., Verstraeten, T. (2000). “Regression models for disease prevalence with diagnostic tests on pools of serum samples.” Biometrics, 56, 1126–1133.

Verstraeten, T., Farah, B., Duchateau, L., Matu, R. (1998). “Pooling sera to reduce the cost of HIV surveillance: a feasibility study in a rural Kenyan district.” Tropical Medicine & International Health, 3, 747–750.

Xie, M. (2001). “Regression analysis of group testing samples.” Statistics in Medicine, 20, 1957–1969.

Author

Maintainer: Brianna Hitt brianna.hitt@afacademy.af.edu (ORCID)

Authors:

Christopher Bilder (ORCID)
Frank Schaarschmidt (ORCID)
Brad Biggerstaff (ORCID)
Christopher McMahan (ORCID)
Joshua Tebbs (ORCID)

Other contributors:

Boan Zhang [contributor]
Michael Black [contributor]
Peijie Hou [contributor]
Peng Chen [contributor]
Minh Nguyen [contributor]

Examples


# 1) Identification using hierarchical and array-based
#   group testing algorithms with an assay that tests
#   for one disease.

# 1.1) Find the optimal testing configuration over a
#   range of initial group sizes, using informative
#   three-stage hierarchical testing, where
#   p denotes the overall prevalence of disease (mean
#    parameter of a beta distribution);
#   Se denotes the sensitivity of the diagnostic test;
#   Sp denotes the specificity of the diagnostic test;
#   group.sz denotes the range of initial pool sizes
#   for consideration;
#   obj.fn specifies the objective functions for which
#   to find results; and
#   alpha is the heterogeneity level.
set.seed(1002)
results1 <- OTC1(algorithm = "ID3", p = 0.025, Se = 0.95,
                 Sp = 0.95, group.sz = 3:20,
                 obj.fn = "ET", alpha = 2)
#> Initial Group Size = 3
#> Initial Group Size = 4
#> Initial Group Size = 5
#> Initial Group Size = 6
#> Initial Group Size = 7
#> Initial Group Size = 8
#> Initial Group Size = 9
#> Initial Group Size = 10
#> Initial Group Size = 11
#> Initial Group Size = 12
#> Initial Group Size = 13
#> Initial Group Size = 14
#> Initial Group Size = 15
#> Initial Group Size = 16
#> Initial Group Size = 17
#> Initial Group Size = 18
#> Initial Group Size = 19
#> Initial Group Size = 20
#> 
#>  Number of minutes running:  0.02 
#>  
summary(results1)
#> 
#> Algorithm: Informative three-stage hierarchical testing 
#> 
#> Optimal testing configuration:
#>    Stage 1 Stage 2
#> ET      17 6,4,4,3
#> 
#> Expected number of tests:
#>    E(T)  Value
#> ET 4.14 0.2436
#> 
#> E(T) denotes the expected number of tests.
#> Value denotes the objective function value per individual.
#> 
#> Overall accuracy of the algorithm:
#>       PSe    PSp   PPPV   PNPV
#> ET 0.8574 0.9961 0.8493 0.9963
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.

# 1.2) Find the optimal testing configuration using
#   non-informative array testing without master pooling.
# The sensitivity and specificity differ for row/column
#   testing and individual testing.
results2 <- OTC1(algorithm = "A2", p = 0.05,
                 Se = c(0.95, 0.99), Sp = c(0.95, 0.98),
                 group.sz = 3:15, obj.fn = "ET")
#> Row/Column Size = 3, Array Size = 9
#> Row/Column Size = 4, Array Size = 16
#> Row/Column Size = 5, Array Size = 25
#> Row/Column Size = 6, Array Size = 36
#> Row/Column Size = 7, Array Size = 49
#> Row/Column Size = 8, Array Size = 64
#> Row/Column Size = 9, Array Size = 81
#> Row/Column Size = 10, Array Size = 100
#> Row/Column Size = 11, Array Size = 121
#> Row/Column Size = 12, Array Size = 144
#> Row/Column Size = 13, Array Size = 169
#> Row/Column Size = 14, Array Size = 196
#> Row/Column Size = 15, Array Size = 225
#> 
#>  Number of minutes running:  0.07 
#>  
summary(results2)
#> 
#> Algorithm: Non-informative array testing without master pooling 
#> 
#> Optimal testing configuration:
#>    Row/column size Array size
#> ET              10        100
#> 
#> Expected number of tests:
#>     E(T)  Value
#> ET 38.52 0.3852
#> 
#> E(T) denotes the expected number of tests.
#> Value denotes the objective function value per individual.
#> 
#> Overall accuracy of the algorithm:
#>       PSe    PSp   PPPV   PNPV
#> ET 0.8943 0.9971 0.9411 0.9945
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.

# 1.3) Calculate the operating characteristics using
#   informative two-stage hierarchical (Dorfman) testing,
#   implemented via the pool-specific optimal Dorfman
#   (PSOD) method described in McMahan et al. (2012a).
# Hierarchical testing configurations are specified by
#   a matrix in the hier.config argument. The rows of
#   the matrix correspond to the stages of the
#   hierarchical testing algorithm, the columns
#   correspond to the individuals to be tested, and the
#   cell values correspond to the group number of each
#   individual at each stage.
config.mat <- matrix(data = c(rep(1, 5), rep(2, 4), 3, 1:10),
                     nrow = 2, ncol = 10, byrow = TRUE)
set.seed(8791)
results3 <- opChar1(algorithm = "ID2", p = 0.02,
                    Se = 0.95, Sp = 0.99,
                    hier.config = config.mat, alpha = 0.5)
#> 
#>  Number of minutes running:  0 
#>  
summary(results3)
#> 
#> Algorithm: Informative two-stage hierarchical testing 
#> 
#> Testing configuration:
#> Block size: 10
#> Group sizes: 5,4,1
#> 
#> Expected number of tests: 3.57
#> Expected number of tests per individual: 0.3566
#> 
#> Accuracy for individuals:
#>       PSe    PSp   PPPV   PNPV Individuals
#> 1  0.9025 0.9997 0.6329 0.9999           1
#> 2  0.9025 0.9997 0.8426 0.9998           2
#> 3  0.9025 0.9997 0.9190 0.9997           3
#> 4  0.9025 0.9998 0.9562 0.9994           4
#> 5  0.9025 0.9998 0.9758 0.9991           5
#> 6  0.9025 0.9991 0.9291 0.9987           6
#> 7  0.9025 0.9991 0.9535 0.9980           7
#> 8  0.9025 0.9992 0.9708 0.9971           8
#> 9  0.9025 0.9993 0.9839 0.9956           9
#> 10 0.9500 0.9900 0.8843 0.9960          10
#> 
#> Overall accuracy of the algorithm:
#>      PSe    PSp   PPPV   PNPV
#> 1 0.9202 0.9986 0.9306 0.9984
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.

# 1.4) Calculate the operating characteristics using
#   non-informative four-stage hierarchical testing.
config.mat <- matrix(data = c(rep(1, 15), rep(c(1, 2, 3), each = 5),
                              rep(1, 3), rep(2, 2),
                              rep(3, 3), rep(4, 2),
                              rep(5, 4), 6, 1:15),
                     nrow = 4, ncol = 15, byrow = TRUE)
results4 <- opChar1(algorithm = "D4", p = 0.008,
                    Se = 0.96, Sp = 0.98,
                    hier.config = config.mat,
                    a = c(1, 4, 6, 9, 11, 15))
#> 
#>  Number of minutes running:  0 
#>  
summary(results4)
#> 
#> Algorithm: Non-informative four-stage hierarchical testing 
#> 
#> Testing configuration:
#> Stage 1: 15
#> Stage 2: 5,5,5
#> Stage 3: 3,2,3,2,4,1
#> 
#> Expected number of tests: 1.91
#> Expected number of tests per individual: 0.1272
#> 
#> Accuracy for individuals:
#>      PSe    PSp   PPPV   PNPV Individuals
#> 1 0.8493 0.9997 0.9596 0.9988         1,6
#> 2 0.8493 0.9998 0.9784 0.9988         4,9
#> 3 0.8493 0.9996 0.9416 0.9988          11
#> 4 0.8847 0.9994 0.9202 0.9991          15
#> 
#> Overall accuracy of the algorithm:
#>      PSe    PSp   PPPV   PNPV
#> 1 0.8517 0.9997 0.9568 0.9988
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.


# 2) Identification using hierarchical and array-based
#   group testing algorithms with a multiplex assay that
#   tests for two diseases.

# 2.1) Find the optimal testing configuration using
#   non-informative two-stage hierarchical testing, given
#   p.vec, a vector of overall joint probabilities of disease;
#   Se, a vector of sensitivity values for each disease; and
#   Sp, a vector of specificity values for each disease.
# Se and Sp can also be specified as a matrix, where one
#   value is specified for each disease at each stage of
#   testing.
results5 <- OTC2(algorithm = "D2",
                 p.vec = c(0.90, 0.04, 0.04, 0.02),
                 Se = c(0.99, 0.99), Sp = c(0.99, 0.99),
                 group.sz = 3:20)
#> Initial Group Size = 3
#> Initial Group Size = 4
#> Initial Group Size = 5
#> Initial Group Size = 6
#> Initial Group Size = 7
#> Initial Group Size = 8
#> Initial Group Size = 9
#> Initial Group Size = 10
#> Initial Group Size = 11
#> Initial Group Size = 12
#> Initial Group Size = 13
#> Initial Group Size = 14
#> Initial Group Size = 15
#> Initial Group Size = 16
#> Initial Group Size = 17
#> Initial Group Size = 18
#> Initial Group Size = 19
#> Initial Group Size = 20
#> 
#>  Number of minutes running:  0 
#>  
summary(results5)
#> 
#> Algorithm: Non-informative two-stage hierarchical testing 
#> 
#> Optimal testing configuration:
#>    Stage 1
#> ET       4
#> 
#> Expected number of tests:
#>    E(T)  Value
#> ET 2.42 0.6045
#> 
#> E(T) denotes the expected number of tests.
#> Value denotes the objective function value per individual.
#> 
#> Overall accuracy of the algorithm:
#>      PSe    PSp   PPPV   PNPV
#> 1 0.9845 0.9969 0.9525 0.9990
#> 2 0.9845 0.9969 0.9525 0.9990
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.

# 2.2) Calculate the operating characteristics for
#   informative five-stage hierarchical testing, given
#   alpha.vec, a vector of shape parameters for the
#   Dirichlet distribution;
#   Se, a matrix of sensitivity values; and
#   Sp, a matrix of specificity values.
Se <- matrix(data = rep(0.95, 10), nrow = 2, ncol = 5, byrow = TRUE)
Sp <- matrix(data = rep(0.99, 10), nrow = 2, ncol = 5, byrow = TRUE)
config.mat <- matrix(data = c(rep(1, 24), rep(1, 18),
                              rep(2, 6), rep(1, 9),
                              rep(2, 9), rep(3, 4), 4, 5,
                              rep(1, 6), rep(2, 3),
                              rep(3, 5), rep(4, 4),
                              rep(5, 3), 6, rep(NA, 2),
                              1:21, rep(NA, 3)),
                     nrow = 5, ncol = 24, byrow = TRUE)
results6 <- opChar2(algorithm = "ID5",
                    alpha = c(18.25, 0.75, 0.75, 0.25),
                    Se = Se, Sp = Sp,
                    hier.config = config.mat)
#> 
#>  Number of minutes running:  0 
#>  
summary(results6)
#> 
#> Algorithm: Informative five-stage hierarchical testing 
#> 
#> Testing configuration:
#> Stage 1: 24
#> Stage 2: 18,6
#> Stage 3: 9,9,4,1,1
#> Stage 4: 6,3,5,4,3,1
#> 
#> Expected number of tests: 12.41
#> Expected number of tests per individual: 0.5169
#> 
#> Disease 1 accuracy for individuals:
#>   PSe PSp PPPV PNPV Individuals
#> 1  NA  NA   NA   NA        <NA>
#> 
#> Disease 2 accuracy for individuals:
#>   PSe PSp PPPV PNPV Individuals
#> 1  NA  NA   NA   NA        <NA>
#> 
#> Overall accuracy of the algorithm:
#>      PSe    PSp   PPPV   PNPV
#> 1 0.8808 0.9978 0.9584 0.9933
#> 2 0.8950 0.9978 0.9437 0.9956
#> 
#> PSe denotes the pooling sensitivity.
#> PSp denotes the pooling specificity.
#> PPPV denotes the pooling positive predictive value.
#> PNPV denotes the pooling negative predictive value.

# 3) Estimation of the overall disease prevalence and
#   calculation of confidence intervals.

# 3.1) Suppose 3 groups out of 24 test positively.
#   Each group has a size of 7.
propCI(x = 3, m = 7, n = 24, ci.method = "CP")
#> 
#> 95 percent Clopper-Pearson confidence interval:
#>  [ 0.003838, 0.05432 ]
#> Point estimate (Maximum Likelihood): 0.0189
propCI(x = 3, m = 7, n = 24, ci.method = "Blaker")
#> 
#> 95 percent Blaker confidence interval:
#>  [ 0.005056, 0.0513 ]
#> Point estimate (Maximum Likelihood): 0.0189
propCI(x = 3, m = 7, n = 24, ci.method = "score")
#> 
#> 95 percent Wilson score confidence interval:
#>  [ 0.006325, 0.05164 ]
#> Point estimate (Maximum Likelihood): 0.0189
propCI(x = 3, m = 7, n = 24, ci.method = "soc")
#> 
#> 95 percent Second-Order Corrected confidence interval:
#>  [ 0.003365, 0.04737 ]
#> Point estimate (Maximum Likelihood): 0.0189

# 3.2) Consider the following situation:
#   0 out of 5 groups test positively with groups
#   of size 1 (individual testing),
#   0 out of 5 groups test positively with groups of size 5,
#   1 out of 5 groups test positively with groups of size 10,
#   2 out of 5 groups test positively with groups of size 50
propCI(x = c(0, 0, 1, 2), m = c(1, 5, 10, 50),
       n = c(5, 5, 5, 5), pt.method = "Gart",
       ci.method = "skew-score")
#> 
#> 95 percent Skew-Corrected score (Gart) confidence interval:
#>  [ 0.002806, 0.02879 ]
#> Point estimate (Gart's Correction): 0.01027

# 4) Estimate a group testing regression model.

# 4.1) Fit a group testing regression model with
#   simple pooling using the "hivsurv" dataset.
data(hivsurv)
fit1 <- gtReg(type = "sp",
              formula = groupres ~ AGE + EDUC.,
              data = hivsurv, groupn = gnum,
              sens = 0.9, spec = 0.9, method = "Xie")
#> 
#>  Number of minutes running: 0 
#>  
summary(fit1)
#> 
#> Call:
#> gtReg(type = "sp", formula = groupres ~ AGE + EDUC., data = hivsurv, 
#>     groupn = gnum, sens = 0.9, spec = 0.9, method = "Xie")
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.1729  -0.9406  -0.8281   1.3386   1.6497  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)  
#> (Intercept) -3.11976    1.84816  -1.688   0.0914 .
#> AGE         -0.05692    0.07777  -0.732   0.4642  
#> EDUC.        0.82833    0.50717   1.633   0.1024  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#>     Null deviance: 154.3  on 85  degrees of freedom
#> Residual deviance: 109.5  on 83  degrees of freedom
#> AIC: 115.5
#> 
#> Number of iterations in EM: 43
#> 

# 4.2) Simulate data for the halving protocol, and
#   fit a group testing regression model.
set.seed(46)
gt.data <- gtSim(type = "halving", par = c(-6, 0.1),
                 gshape = 17, gscale = 1.4,
                 size1 = 1000, size2 = 5,
                 sens = 0.95, spec = 0.95)
fit2 <- gtReg(type = "halving", formula = gres ~ x,
              data = gt.data, groupn = groupn,
              subg = subgroup, retest = retest,
              sens = 0.95, spec = 0.95,
              start = c(-6, 0.1), trace = TRUE)
#> beta is -5.729041 0.09475251 	diff is 0.05247494 
#> beta is -5.681919 0.09362653 	diff is 0.0118833 
#> beta is -5.673643 0.09340603 	diff is 0.002355171 
#> beta is -5.672209 0.09336529 	diff is 0.0004361546 
#> beta is -5.671964 0.09335803 	diff is 7.774958e-05 
#> 
#>  Number of minutes running: 0 
#>  
summary(fit2)
#> 
#> Call:
#> gtReg(type = "halving", formula = gres ~ x, data = gt.data, groupn = groupn, 
#>     subg = subgroup, retest = retest, sens = 0.95, spec = 0.95, 
#>     start = c(-6, 0.1), trace = TRUE)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.9189  -0.6950  -0.6348  -0.5744   1.9966  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -5.67196    0.80874  -7.013 2.33e-12 ***
#> x            0.09336    0.02866   3.258  0.00112 ** 
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> AIC: 365.9
#> 
#> Number of iterations in EM: 5
#>