Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better convergence tests #120

Open
bbolker opened this issue Sep 5, 2013 · 34 comments
Open

better convergence tests #120

bbolker opened this issue Sep 5, 2013 · 34 comments

Comments

@bbolker
Copy link
Member

bbolker commented Sep 5, 2013

In Banff, Rune commented that he had a suite of convergence tests for GLMMs in the ordinal package. We should think about adapting/stealing these; also worth looking at the KKT criterion calculators from John Nash's packages (not sure which package they're in -- I don't think they're in optimx any more).

On a related topic, this test for misspecification -- http://dx.doi.org/10.1016/j.csda.2008.02.033 looks interesting.

@runehaubo
Copy link

In case you want to do more in this respect, I have pasted in the relevant code from ordinal below (for context see pkg/ordinal/R/convergence.R on R-forge). This function is implemented for CLM's without random effects - implementation for CLMMs is still on my to do list. That would require dealing with the issues related to theta-values on the boundary and checking positive definiteness of the individual random-effects variance-covariance matrices.

Note that negative convergence values indicate convergence failure and positive values indicate several levels of success. For instance res$code == 1L means that the Hessian is singular (not negative definite) so we found an optimum and the likelihood value is correct and we can use that in LR tests etc., but the parameters are not uniquely determined.

Note also that the Theta.ok argument is specific to cumulative link models (has to do with the order of the thresholds when nominal effects are at work).

Feel free to use this for inspiration or copy what you feel is relevant. Comments and critique is of course more than welcome.

Cheers,
Rune

conv.check <-
    function(fit, control=NULL, Theta.ok=NULL, tol=sqrt(.Machine$double.eps), ...)
### Compute variance-covariance matrix and check convergence along the
### way.
### fit: clm-object or the result of clm.fit.NR()
###
### Return: list with elements
### vcov, conv, cond.H, messages and
{
    if(missing(control))
        control <- fit$control
    if(is.null(control))
        stop("'control' not supplied - cannot check convergence")

    if(!is.null(control$tol))
        tol <- control$tol
    if(tol < 0)
        stop(gettextf("numerical tolerance is %g, expecting non-negative value",
                      tol), call.=FALSE)
### FIXME: test this.
    H <- fit$Hessian
    g <- fit$gradient
    max.grad <- max(abs(g))
    cov <- array(NA_real_, dim=dim(H), dimnames=dimnames(H))
    cond.H <- NA_real_
    res <- list(vcov=cov, code=integer(0L), cond.H=cond.H,
                messages=character(0L))
    class(res) <- "conv.check"
    if(is.list(code <- fit$convergence))
        code <- code[[1L]]
    mess <-
        switch(as.character(code),
               "0" = "Absolute and relative convergence criteria were met",
               "1" = "Absolute convergence criterion was met, but relative criterion was not met",
               "2" = "iteration limit reached",
               "3" = "step factor reduced below minimum",
               "4" = "maximum number of consecutive Newton modifications reached")
    res <- c(res, alg.message=mess)
    ## }
    if(max.grad > control$gradTol) {
        res$code <- -1L
        res$messages <-
            gettextf("Model failed to converge with max|grad| = %g (tol = %g)",
                     max.grad, control$gradTol)
        return(res)
    }
    evd <- eigen(H, symmetric=TRUE, only.values=TRUE)$values
    negative <- sum(evd < -tol)
    if(negative) {
        res$code <- -2L
        res$messages <-
            gettextf(paste("Model failed to converge:",
                           "degenerate Hessian with %d negative eigenvalues"),
                     negative)
        return(res)
    }
    if(!is.null(Theta.ok) && !Theta.ok) {
        res$code <- -3L
        res$messages <-
            "not all thresholds are increasing: fit is invalid"
        return(res)
    }
    zero <- sum(abs(evd) < tol)
    ch <- try(chol(H), silent=TRUE)
    if(zero || inherits(ch, "try-error")) {
        res$code <- 1L
        res$messages <-
            "Hessian is numerically singular: parameters are not uniquely determined"
        return(res)
    }
    ## Add condition number to res:
    res$cond.H <- max(evd) / min(evd)
    ## Compute Newton step:
    step <- c(backsolve(ch, backsolve(ch, g, transpose=TRUE)))
    ## Compute var-cov:
    res$vcov[] <- chol2inv(ch)
    if(max(abs(step)) > control$relTol) {
        res$code <- c(res$code, 1L)
        corDec <- as.integer(min(cor.dec(step)))
        res$messages <-
            c(res$messages,
              gettextf("some parameters may have only %d correct decimals",
                       corDec))
    }
    if(max(evd) * tol > 1) {
        res$code <- c(res$code, 2L)
        res$messages <-
            c(res$messages,
              paste("Model is nearly unidentifiable: ",
                    "very large eigenvalue",
                    "\n - Rescale variables?", sep=""))
    }
    if((min(evd) / max(evd)) < tol) {
        res$code <- c(res$code, 3L)
        if(!5L %in% res$code) {
            res$messages <-
                c(res$messages,
                  paste("Model is nearly unidentifiable: ",
                        "large eigenvalue ratio",
                        "\n - Rescale variables?", sep=""))
        }
    }
    if(!length(res$code)) {
        res$code <- 0L
        res$messages <- "successful convergence"
    }
    res
}

print.conv.check <-
    function(x, action=c("warn", "silent", "stop", "message"), ...)
{
    action <- match.arg(action)
    if(x$code == 0L || action == "silent") return(invisible())

    Text <- paste("(", x$code[1L], ") ", x$messages[1L], sep="")
    if(!is.null(alg.text <- x$alg.message))
        Text <- paste(Text, "\nIn addition:", alg.text)
    switch(action,
           "stop" = stop(Text, call.=FALSE),
           "warn" = warning(Text, call.=FALSE),
           "message" = message(Text))
}

@bbolker
Copy link
Member Author

bbolker commented Oct 1, 2013

Might also be worth linking to http://www.inside-r.org/packages/cran/brglm/docs/separation.detection , or considering its strategy (from http://article.gmane.org/gmane.comp.lang.r.general/300335 )

@bbolker
Copy link
Member Author

bbolker commented Nov 27, 2013

Various rambling thoughts:

  • I think it makes sense to go ahead and import genD from the numDeriv package: this will add one more dependency, but it's a fairly lightweight and widely used package anyway. We could write our own gradient and Hessian functions, as Rune did, but I'd rather not re-invent the wheel: Rune, do you have some evidence that it would be worthwhile?
  • I'm trying to figure out the control interface. It doesn't make sense to calculate the derivatives unless we have to, but once we do it seems sensible to use them both for convergence checks and to calculate the standard errors on the beta parameters (for dealing with issue Incorrect standard error of beta-hat #47). My first thought thought was a calc_deriv flag in [g]lmerControl (unfortunately, we already have both dot-separated (restart_edge) and underscore-separated (check.nobs.vs.nlev) flags in [g]lmerControl: we could rename restart_edge to restart.edge, using the former as a deprecated synonym for now ...) which would turn on both convergence testing and hessian-based SE estimates, but probably they should be two separate logical flags -- say check.conv and hessian.stderr ? (If either is TRUE then we compute derivatives in optwrap and return them as part of the attributes (although we would want to disable the derivative computation in the first, nAGQ=0 phase of GLMM fitting) ...)
  • I propose that we add the derivatives (gradient+hessian), in the format returned by numDeriv::genD, as a third component of the @devcomp slot (which already contains $cmp, a vector of individual numeric scalars such as PWRSS and REML, and $dims, a vector of integer scalars and binary flags).

@dmbates
Copy link
Member

dmbates commented Nov 27, 2013

Yesterday I pushed a new version of lmer.tex to our private repository. In there I describe the calculations for the analytic gradient for the profiled log-likelihood or REML criterion. In general this is not a big win because it involves either solving q by q systems multiple times or taking the dreaded inversion of the sparse Cholesky factor. However, in the case of a single random-effects term or nested grouping factors, it is feasible to do this.

It seems that cases of poor convergence are often associated with vector-valued random effects. I'll take a look at whether it is reasonable to evaluate the analytic gradient for the case of a single, vector-valued random-effects term.

@dmbates
Copy link
Member

dmbates commented Nov 27, 2013

I think that evaluation of the "difficult" term in the expression for the gradient will be relatively easy in the case of a single, vector-valued, random-effects term if L and ZtZ are packed into 3-dimensional arrays. The point is that L is block-diagonal.

I managed to work it out in the LMMVector1 representation in the Julia MixedModels package and it does give the correct values

julia> gradient(f, th)
10-element Array{Float64,1}:
 -14835.2  
    740.121
    481.964
   1068.77 
 -12533.1  
   1007.36 
   1801.64 
 -12864.3  
    422.39 
 -13846.4  

julia> res - syr2k('L','N',1.,Ztr,fm1.u)/scale(fm1,true)
4x4 Array{Float64,2}:
 -14835.2        -2.92906e-5      -0.000865546       0.0115895 
    740.12   -12533.0             -0.0199377        -0.00443789
    481.963    1007.36        -12864.3              -0.0188079 
   1068.78     1801.64           422.39         -13846.5       

This is the gradient at the starting estimates for ML in the simulated example sent by Laura Bringmann. An individual evaluation of the objective (negative twice the log-likelihood) takes about 0.5 sec on my desktop so the central finite-difference gradient takes close to 9 seconds.

julia> @time f(th)
elapsed time: 0.466127221 seconds (63218368 bytes allocated)
-19516.97535188828

julia> @time gradient(f, th);
elapsed time: 8.667017285 seconds (1264367056 bytes allocated)

julia> @time gradient(f, th, :forward);
elapsed time: 4.763594825 seconds (695402176 bytes allocated)

I think the analytic gradient evaluation will be much faster than the finite-difference evaluation. I'll code it up now.

@dmbates
Copy link
Member

dmbates commented Nov 27, 2013

I got it coded up in Julia for the particular case of a single, vector-valued random-effects term. At the ML estimates for the simulated example sent by Laura Bringmann I get

julia> @time grad(fm1)    # analytic gradient
elapsed time: 0.111372923 seconds (20682456 bytes allocated)
10-element Array{Float64,1}:
 -0.0601314 
 -0.0343705 
 -0.00175826
 -0.0214015 
 -0.00521561
 -0.0271165 
 -0.0339831 
  0.0374451 
 -0.0356073 
  0.0844693 

julia> @time gradient(f, th)     # central finite-differences
elapsed time: 8.697922726 seconds (1264367056 bytes allocated)
10-element Array{Float64,1}:
 -0.0601344  
 -0.0354681  
 -0.000934209
 -0.023643   
 -0.00521386 
 -0.0267069  
 -0.0379575  
  0.0374439  
 -0.0370824  
  0.0849209

One of the gradient-based optimizers from NLopt does a great job on this problem now.

julia> dd = DataFrame(read_rda("/var/tmp/dd.rda")["dd"]);

julia> fm1 = lmm(:(Y ~ X1 + X2 + X3 + (X1 + X2 + X3|indiv)), dd);

julia> fit(fm1,true,:LD_MMA)
f_1: -19516.97535, [1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0]
f_2: -39241.77318, [1.9,-0.9,-0.9,-0.9,1.9,-0.9,-0.9,1.9,-0.9,1.9]
f_3: -55518.44113, [2.8,0.0,0.0,0.0,2.8,0.0,0.0,2.8,0.0,2.8]
f_4: -58876.02262, [3.58235,-0.358756,-0.28832,-0.421558,3.56465,-0.411242,-0.509618,3.56746,-0.26706,3.57527]
f_5: -60690.71466, [4.47944,0.362396,0.429531,0.318362,4.45232,0.283756,0.163937,4.45616,0.388081,4.46659]
f_6: -61952.03137, [5.55944,-0.246584,-0.178461,-0.289818,5.53232,-0.318704,-0.442857,5.53616,-0.210779,5.54659]
f_7: -62586.2446, [6.85544,0.101953,0.194459,-0.50237,6.82832,-0.612018,-0.85953,6.83216,0.143818,6.84259]
f_8: -62897.37305, [8.41064,-0.206747,-0.114241,0.0268298,8.38352,-0.0828178,-0.33033,8.38736,-0.164882,8.39779]
f_9: -63004.91419, [10.2769,-0.422837,-0.330331,-0.34361,10.2498,-0.453258,-0.70077,10.2536,-0.380972,10.264]
f_10: -62986.31918, [12.1431,-0.60186,-0.422365,-0.709167,11.8857,-0.814802,-1.07121,12.1198,-0.529517,12.1303]
f_11: -62992.28445, [12.1431,-0.520373,-0.350615,-0.627009,11.2443,-0.727098,-1.00885,11.9484,-0.439437,12.1087]
f_12: -63002.29961, [11.8773,-0.445522,-0.332703,-0.477038,10.5258,-0.574723,-0.869848,11.3596,-0.390401,11.6991]
f_13: -63008.64808, [11.2363,-0.425757,-0.330592,-0.371123,10.288,-0.476764,-0.743155,10.6242,-0.382067,10.973]
f_14: -63009.0832, [11.1644,-0.427679,-0.331244,-0.385482,10.2641,-0.489801,-0.767321,10.339,-0.383067,10.6628]
f_15: -63009.18824, [11.1915,-0.429101,-0.33164,-0.396257,10.2742,-0.499401,-0.785592,10.4396,-0.38369,10.7371]
f_16: -63008.72809, [11.1231,-0.447147,-0.337497,-0.436296,10.239,-0.53592,-0.845403,10.063,-0.392821,10.4645]
f_17: -63009.23308, [11.1803,-0.432239,-0.332587,-0.404232,10.2685,-0.506516,-0.799106,10.3629,-0.385195,10.6856]
f_18: -63008.7894, [11.2622,-0.468436,-0.344374,-0.478047,10.3156,-0.574504,-0.902504,10.8153,-0.403757,11.0101]
f_19: -63009.34215, [11.1975,-0.440596,-0.335008,-0.424752,10.2781,-0.524873,-0.832489,10.4992,-0.389132,10.7716]
f_20: -63009.07383, [11.0968,-0.518742,-0.366001,-0.564015,10.2152,-0.655587,-1.0122,10.0245,-0.437333,10.3941]
f_21: -63009.61615, [11.1718,-0.464273,-0.342831,-0.47844,10.2628,-0.573593,-0.913704,10.2994,-0.40213,10.6322]
f_22: -63009.61687, [11.3248,-0.614987,-0.398528,-0.716269,10.3853,-0.798486,-1.19986,10.744,-0.500564,11.0104]
f_23: -63010.42543, [11.1063,-0.588566,-0.38274,-0.763461,10.1778,-0.832115,-1.31589,10.3703,-0.504381,10.6507]
f_24: -63010.51364, [11.2461,-0.568719,-0.368417,-0.767037,10.3619,-0.823659,-1.36279,10.4502,-0.492815,10.7235]
f_25: -63010.55464, [11.1993,-0.565832,-0.365924,-0.772123,10.2892,-0.825778,-1.38783,10.4048,-0.490288,10.6854]
f_26: -63010.56045, [11.1996,-0.562139,-0.363411,-0.774561,10.3008,-0.825208,-1.40431,10.4121,-0.487716,10.6894]
f_27: -63010.56294, [11.1996,-0.52455,-0.338353,-0.802165,10.2944,-0.82281,-1.53433,10.4054,-0.461637,10.6866]
f_28: -63010.56942, [11.1979,-0.543388,-0.358481,-0.788481,10.3249,-0.822208,-1.4808,10.4167,-0.47848,10.6913]
f_29: -63010.57076, [11.2001,-0.544991,-0.356392,-0.787165,10.295,-0.824967,-1.48203,10.4078,-0.478137,10.6876]
f_30: -63010.57118, [11.1994,-0.544498,-0.355878,-0.786782,10.3042,-0.823787,-1.4799,10.4105,-0.477827,10.6885]
f_31: -63010.57104, [11.1989,-0.544692,-0.354937,-0.785012,10.3088,-0.823231,-1.47352,10.4116,-0.477376,10.6892]
f_32: -63010.57118, [11.1994,-0.544518,-0.355782,-0.786603,10.3047,-0.823731,-1.47925,10.4106,-0.477782,10.6886]
f_33: -63010.57114, [11.1988,-0.544666,-0.354751,-0.786355,10.301,-0.823331,-1.47845,10.4109,-0.4773,10.6888]
f_34: -63010.57118, [11.1993,-0.544533,-0.355679,-0.786578,10.3043,-0.823691,-1.47917,10.4107,-0.477733,10.6886]
f_35: -63010.57118, [11.1993,-0.54454,-0.355633,-0.786567,10.3046,-0.823668,-1.47912,10.4107,-0.477712,10.6886]
f_36: -63010.57118, [11.1993,-0.544547,-0.355604,-0.786561,10.3045,-0.823658,-1.4791,10.4107,-0.477699,10.6886]
FTOL_REACHED
Linear mixed model fit by maximum likelihood
 logLik: 31505.285592, deviance: -63010.571184

 Variance components:
                Variance    Std.Dev.
 indiv          1.254095    1.119864
 Residual       1.081380    1.039894
 Number of obs: 40000; levels of grouping factors: 200

  Fixed-effects parameters:
        Estimate Std.Error z value
[1,]     5.14017 0.0788611   65.18
[2,]     5.04852 0.0726618 69.4797
[3,]     4.99547 0.0735801 67.8915
[4,]     5.12455 0.0762583 67.1999

(Don't pay attention to the description of the variance components - they're not being displayed properly right now.)

A comparable fit with a derivative-free optimizer (BOBYQA as implemented in NLopt) required about 850 function evaluations which, of course, made it much slower.

@runehaubo
Copy link

On hessian computations for glmms:

I haven't use the genD function from numDeriv that much, but mainly the grad and hessian functions. The problem I ran into when using the numDeriv hessian function was that I couldn't turn off the Richardson extrapolation part (I had a small email exchange with Paul Gilbert about this), which meant that it took many more function evaluations to obtain the Hessian than to optimize the likelihood function. Richardson extrapolations are good if the objective function is very far from a quadratic and it is paramount that a very accurate Hessian is obtained. I don't think this is the situation we are in with GLMMs. From the ordinal examples I looked at, the differences in standard errors were very small (3rd decimal at most IIRC) between numDerivs hessian and my own central finite difference implementation. I don't remember which examples I looked at, but my feeling is that differences will be small even if some variance parameters are close to the boundary or otherwise ill determined. In ordinal, I valued major speed gains over what seems to be irrelevant accuracy gains, and I would recommend the same for glmer.

I decided to split some of the tools functions in ordinal into a separate package, Rufus (Rune's utility functions) currently on R-Forge under the ordinal project, but which I am contemplating releasing to CRAN. Rufus has the simple central difference gradient and hessian functions, and so if you don't want those functions to live privately in lme4, you could grap them from here. (They are in my view fairly small self-contained functions that could easily live privately in lme4.)

@bbolker
Copy link
Member Author

bbolker commented Nov 28, 2013

Thanks, Rune, that makes sense (it's sort of too bad that numDeriv doesn't implement method "simple" for hessians -- it is allowed for gradients ... it wouldn't seem too hard for them to incorporate it). I will steal the central finite differences code from Rufus. The only reason I was thinking about using genD was that it returns the gradient and Hessian information from a single function call (otherwise, if you use grad and hessian separately, you're repeating the gradient calculations twice since you need them for the hessian ...)

@runehaubo
Copy link

Yes, I made that suggestion to Paul Gilbert, but he wasn't interested. If you grab deriv12 from Rufus, you will get both gradient (for "free") and hessian. It's there exactly for the reason you mention.

@bbolker
Copy link
Member Author

bbolker commented Nov 28, 2013

A few further thoughts on setting tolerances for gradient checking, just so they're documented somewhere.

  • should look to see what Rune does
  • absolute tolerances (1e-4? 1e-5?) should be fine for theta parameters, because these are all relative to residual variance (for LMMs) or on the link scale (for GLMMs), so we expect the parameters to be roughly O(1) ...
  • beta parameters for GLMMs are a bit harder; we're on the link scale so parameters for categorical variables should be OK, but parameters for continuous variables could be weird. If the scale of the covariate is >>1 then the parameters will be <<1 and the absolute tolerances will be too lax, and vice versa if the scale of the covariate is <<1.
  • I suggest that we use absolute tolerances throughout, but provide the user the option of changing the absolute tolerances and/or using relative tolerances instead. (It's a little less critical because the results of a bad decision will be a false-positive or a false-negative warning [where there was no warning at all before!], rather than an error ...)

@bbolker
Copy link
Member Author

bbolker commented Dec 9, 2013

still thinking about Rune's deriv12, which I'm happy to steal. That function assumes that the function can be evaluated in a radius of +/-delta (1e-4 by default). I'm wondering if we should add logic to (1) detect boundaries closer than delta (and use the distance to the boundary as delta instead) and (2) use one-sided differencing if the estimate is on the boundary. Obviously this logic could be tricky (especially when we go to second derivatives), and we will have to be much more careful in using gradient and second-derivative information to evaluate convergence when we are at or near the boundary, but it seems like it might be worth some effort. (The alternative is to skip the derivative calculations and convergence checking when we are at or near the boundary.) Thoughts ... ?

@bbolker
Copy link
Member Author

bbolker commented Jan 2, 2014

This is getting there. I don't want to close the issue at least until I've implemented some tests (which don't exist yet). Also, I think I may have the gradient tolerance set too low (right now it warns if max(|grad|)>0.001). A lot of Nelder-Mead examples fail, for example the standard fm2 <- lmer(Reaction~Days+(Days|Subject),sleepstudy) has a max gradient of -0.0012 (on the other hand, the bobyqa fit for the same model has max gradient of -3e-5, and 106 vs 191 function evaluations to boot ... more evidence that we should switch the default?)

@mmaechler
Copy link
Member

all these "absolute" scales 1e-4, 0.001 are they really applied to "scale free" (or "relative") criteria?

@bbolker
Copy link
Member Author

bbolker commented Jan 2, 2014

Yes, in this particular case. For LMMs all of the parameters are 'theta' values, which are scale free/relative. It's not the case for GLMMs, where the betas aren't necessarily scaled. This is part of the difficulty of coming up with good general rules ...

@bbolker
Copy link
Member Author

bbolker commented Feb 3, 2014

Hmm. Can't replicate the issue I commented about above (gradient of -0.0012 with Nelder_Mead).

Doug's analytical gradient stuff above is very exciting. Perhaps I'll see if Gabor Grothendieck can work on this next month ...

In principle I think we ought to be careful not to check gradients for singular cases (if the optimal fit is on the boundary, there's no reason the gradient shouldn't be large in directions normal to the active boundaries). However, I haven't been able to find a test case yet -- in the singular examples I've looked at, the gradient also goes to zero at the boundary (unless I've made a mistake computing the gradients in the boundary case ...)

@rubenarslan
Copy link

Hi, am I understanding this discussion correctly as meaning that your new convergence checks are sensitive to sample size? Sorry to interject, I can make a new issue if that's better.

I am getting some new convergence warnings on models that used to run fine, with everything that I'm used to checking looking fine (scaled my predictors, no unity correlations, no zero variances).

I could replicate this using sample data (sleepstudy and Penicillin) by increasing the sample size. Same pattern here, warning is triggered but model seems fine (to my probably too naïve checks).

 data(Penicillin)
 summary(fm2 <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin))
 Penicillin2 = Penicillin[sample(x=nrow(Penicillin),size=200000,replace=T),]
 summary(fm2 <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin2))

Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00507687 (tol = 0.002)

@bbolker
Copy link
Member Author

bbolker commented Mar 31, 2014

We have not done any systematic testing, but anecdotally it does seem that there's a tendency for more convergence warnings with bigger problems, although now that I think of it I have noticed more issues with large parameter sets than with large data (although the two are correlated in practice). Perhaps the discrepancy between the convergence criteria for our derivative-free optimizers and the size of the finite-difference gradient increases with problem size? (Naively I would have expected it to go the other way ...) Your example is both useful (because it gives us a way to isolate one effect for a single model structure) and disturbing (because I thought that most of the convergence warnings were occurring with glmer() fits, which are still using Nelder-Mead by default for part of the optimization process
...)

@rubenarslan
Copy link

If this helps: My use case where I started seeing convergence warnings after coming back to older analyses a few months later where definitely both large data & large parameter sets (i.e. at it largest 300k individuals in 40k families). It was the above warning in all my models, using both lmer and glmer with a binomial and a Poisson outcome. I also used bobyqa optimisers for some models and after googling my warnings I also tried one that requires optimx (can check when I'm back at work) at it.
The data aren't mine to share unfortunately, but I can of course try to generate some more examples or run some tests on my data.

@bbolker
Copy link
Member Author

bbolker commented Apr 2, 2014

Still working on/thinking about this. I still don't have a strong intuition for why increasing sample sizes should be causing problems, but I have a guess: perhaps the gradients should probably scaled relative to the curvature when testing convergence. That is, instead of the gradient, we should be computing the expected linear change over one standard error of the parameter? This is admittedly a wild guess, but it has a few nice features:

  • when the standard error is O(1) it won't change the convergence test very much.
  • it should take care of convergence warnings for large problems.
  • in one of the examples I have been looking at just now, the gradient element causing the problem (which was not fixed by any of the available optimizers, including nlminb) was indeed the one with a small standard error.

The right way to handle the problem would be to look closely at the convergence criteria for the various optimizers and figure out how they can stop when the derivatives are still non-trivial (by some definition).

@rubenarslan
Copy link

This is beyond my level of understanding, I'm afraid. Can I just ask: If convergence warnings seem to be a bit trigger-happy at the moment, would you recommend other checks besides the ones I mentioned ?

@runehaubo
Copy link

As I understand you, you are talking about what I would call a relative convergence criterion. An approximate error estimate of a current estimate/iterate, is given by the size of the Newton step (I know this as the method independent error theorem) close to the optimum. And the Newton step is exactly the gradient scaled by the variance-covariance matrix of the parameters evaluated at the current iterate.

You compute the Newton step/relative criterion with something like solve(H, g), or, if you like, scale the gradient with the Cholesky factor of H, the Hessian.

I generally think that it is nice to know how close one is to the optimum both in an absolute and a relative sense. In my mind they complement each other, but for any given application one may of course be more important than the other.

IIRC the relative criterion is somewhat similar to the Bates' orthogonality criterion, but I don't have my notes with me to check right now.

I don’t remember all the details, but IIRC, there might be a problem with the relative criterion for parameters with large standard errors. In particular if they are close to zero, the error in those relative to their size (not uncertainty) can be substantial. You might argue that one cannot expect such estimates to be accurate given the substantial statistical uncertainty, but my feeling is that (some) people will expect parameters to be accurately determined irrespective of the statistical uncertainty.

@runehaubo
Copy link

I played around with the Penicillin case and noticed some unexpected differences in the derivative calculations comparing to the numDeriv results - especially for the Hessian:

> Penicillin2 <- Penicillin[sample(x=nrow(Penicillin),size=200000,replace=TRUE),]
> summary(fm2 <- lmer(diameter ~ 1 + (1|plate) + (1|sample),
+                     Penicillin2))
Linear mixed model fit by REML ['lmerMod']
Formula: diameter ~ 1 + (1 | plate) + (1 | sample)
   Data: Penicillin2

REML criterion at convergence: 284480.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3315 -0.8002  0.0340  0.6990  3.3897 

Random effects:
 Groups   Name        Variance Std.Dev.
 plate    (Intercept) 0.7719   0.8786  
 sample   (Intercept) 3.8450   1.9609  
 Residual             0.2424   0.4924  
Number of obs: 200000, groups: plate, 24; sample, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  22.9720     0.8204      28
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0632837 (tol = 0.002)
## Make the deviance function to evaluate derivaties:
> lmod <- lFormula(diameter ~ 1 + (1|plate) + (1|sample), Penicillin2)
devfun2 <- do.call(mkLmerDevfun, lmod)
> parVals <- fm2@theta
> devfun2(parVals) == deviance(fm2)
[1] TRUE
> fm2@optinfo$derivs
$gradient
[1] 0.02867542 0.06328366

$Hessian
           [,1]        [,2]
[1,] 19.6484375  -0.1826172
[2,] -0.1826172 183.3125000

> lme4:::deriv12(devfun2, parVals)
$gradient
[1] 0.02867542 0.06328366

$Hessian
           [,1]        [,2]
[1,] 19.6484375  -0.1826172
[2,] -0.1826172 183.3125000

## Now look at the numDeriv Hessian:
> library(numDeriv)
> (nD.grad <- grad(devfun2, parVals))
[1] 0.03050477 0.05728953
> (nD.hess <- hessian(devfun2, parVals))
              [,1]          [,2]
[1,] 28.8397530035 -0.0003489932
[2,] -0.0003489932  1.2176228863
## Look at the Newton step:
> solve(nD.hess, nD.grad)
[1] 0.001058303 0.047050607
> with(fm2@optinfo$derivs, solve(Hessian, gradient))
[1] 0.0014626472 0.0003466799
> newPar1 <- fm2@theta - solve(nD.hess, nD.grad)
> devfun2(fm2@theta) > devfun2(newPar1) ## OK
[1] TRUE
> devfun2(fm2@theta) - devfun2(newPar1) ## OK
[1] 0.001438726
> (nD.grad1 <- grad(devfun2, newPar1))
[1] -0.002742818  0.024253791

So taking the Newton step does indeed lowers the deviance and decreases the gradient. Not so using the deriv12 calculations:

> newPar2 <- fm2@theta - with(fm2@optinfo$derivs, solve(Hessian, gradient))
> (nD.grad1 <- grad(devfun2, newPar2))
[1] -0.01097560  0.06537335
> lme4:::deriv12(devfun2, newPar2)$gradient
[1] -0.01400564  0.06528862

I also noticed that in this case the Hessian calculation seemed very sensitive to the size of delta:

> sapply(10^(-(1:6)), function(d) {
+     lme4:::deriv12(devfun2, parVals, delta=d)$Hessian[2, 2]
+ })
[1] 1.218145e+00 1.242989e+00 4.163147e+00 1.833125e+02 3.248350e+04
[6] 2.949472e+06

That makes me wonder what the real hessian is, and how the likelihood/deviance function looks when the curvature is this sensitive to the scale that you look at it.

A couple of thoughts:
How much of this problem is due to inaccurately evaluated derivatives? Perhaps the risk of inaccuracies increase with sample size/information? Are the optimizers having a hard time with high information cases, i.e. highly spiked likelihood functions?

@bbolker
Copy link
Member Author

bbolker commented Apr 3, 2014

Quick comments:

  • I would be worried about the derivative calculations (i.e. that I had screwed something up in coding them), especially since I hacked the code considerably to try to allow in a sensible way for boundary cases (although arguably that shouldn't matter since we should be thinking much more carefully about what to do in singular cases anyway). However, in the smaller cases I've looked at, I matched numDeriv::grad correctly, so there obviously weren't glaring bugs (and here we are not talking about singular cases either)
  • I don't think this is a case of the optimizers having trouble -- so far, the investigations I've done have suggested that these are false positives -- that is, the optimizers are doing a reasonable job, but for one reason or another our strategies for estimating convergence are not working properly. (Remember that of the box-constrained optimizers we have available, only nlminb and L-BFGS-B are using explicit (finite-difference) derivative information, and they're only using gradients, not Hessians)
  • just to repeat, I think that the problem here is that (mostly) our convergence tests are too sensitive, not that our fits are actually bad. There is some evidence from what I've done that we should be ditching Nelder-Mead, but I think that's a relatively small part of the problem.

@bbolker
Copy link
Member Author

bbolker commented Apr 8, 2014

I would still like to think about this more, but I'm leaning very strongly towards switching to a relative convergence criterion (as suggested by Rune), abs(solve(H, g)) < tol, and seeing how many of our problems go away ... comments?

@dmbates
Copy link
Member

dmbates commented Apr 8, 2014

I noticed today that the usual model fit to the InstEval data also gives a false positive. In general the deviance for models fit to data sets with a large number of observations and a large number of random effects tends to be large. It doesn't have to be but most times the penalized residual-sum-of-squares is large because of the large number of observations (RSS) and the large number of random effects (contribution to the penalty).

@bbolker
Copy link
Member Author

bbolker commented Apr 9, 2014

Hmmm. On my system I have

 system.time(fm1 <- lmer(y ~ service*dept + (1|s) + (1|d), data=InstEval,
+                             REML=FALSE))
   user  system elapsed 
 36.171   5.189  63.783 
> fm1@optinfo$derivs$gradient
[1] -0.0009109499  0.0003493915

When I do it with REML instead I get

[1] -0.0011370867 -0.0006980554

which would be enough to trigger a warning if we still had the tolerance set at 0.001, but
lmerControl()$checkConv$check.conv.grad$tol is 0.002 in the current release ...

On the other hand, the relative error is much smaller:

with(fm2@optinfo$derivs,solve(Hessian,gradient))
[1] -2.069835e-08 -4.267004e-08

(I have modified my local copy to check the relative gradients rather than the gradients, but investigating @optinfo$derivs$grad in this way should be enough).

I certainly understand that the deviance should be roughly proportional to the number of observations (hadn't thought about the large number of random effects), but that doesn't immediately tell me that the convergence achieved by nonlinear optimizers should be worse in absolute terms ...

Should I check in the change to relative tolerance?

@rubenarslan
Copy link

Let me know if you want me to install a feature branch and test this on my real data.

Am 09.04.2014 um 01:30 schrieb Ben Bolker notifications@github.com:

Hmmm. On my system I have

system.time(fm1 <- lmer(y ~ service*dept + (1|s) + (1|d), data=InstEval,

  •                         REML=FALSE))
    
    user system elapsed
    36.171 5.189 63.783

    fm1@optinfo$derivs$gradient
    [1] -0.0009109499 0.0003493915
    When I do it with REML instead I get

[1] -0.0011370867 -0.0006980554
which would be enough to trigger a warning if we still had the tolerance set at 0.001, but
lmerControl()$checkConv$check.conv.grad$tol is 0.002 in the current release ...

On the other hand, the relative error is much smaller:

with(fm2@optinfo$derivs,solve(Hessian,gradient))
[1] -2.069835e-08 -4.267004e-08
(I have modified my local copy to check the relative gradients rather than the gradients, but investigating @optinfo$derivs$grad in this way should be enough).

I certainly understand that the deviance should be roughly proportional to the number of observations (hadn't thought about the large number of random effects), but that doesn't immediately tell me that the convergence achieved by nonlinear optimizers should be worse in absolute terms ...

Should I check in the change to relative tolerance?


Reply to this email directly or view it on GitHub.

@bbolker
Copy link
Member Author

bbolker commented Apr 9, 2014

thanks. An even simpler test would be to take a fitted example that gave you convergence warnings and take a look at the results of

relgrad <- with(fitted_model@optinfo$derivs,solve(Hessian,gradient))
max(abs(relgrad))

and see if it's reasonably small (e.g. <0.001?)

@rubenarslan
Copy link

It is looking good. I can run some more tests when I return from a conference on Friday.

I am also getting some "very large eigenvalue" warning in some models, I'm attaching the full output. The variables are all scaled properly, but it is zero-inflated data, so that warning may be proper.

> summary(Children <- glmer(children ~ byear + male + paternalage.mean +paternalage.diff + (1|idParents), data= krmh_married,family='poisson',control=glmerControl(optimizer="bobyqa") ))
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod
#[snip]
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00647578 (tol = 0.001)
> relgrad <- with(Children@optinfo$derivs,solve(Hessian,gradient))
> max(abs(relgrad))
[1] 2.893846e-07

> summary(Age <- lmer(age ~ urban + male + paternalage.mean +paternalage.diff + (1|idParents), data= rpqa_born_age,control=lmerControl(optimizer="bobyqa") ))
#[snip]
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0301586 (tol = 0.002)

  > relgrad <- with(Age@optinfo$derivs,solve(Hessian,gradient))
  > max(abs(relgrad))
  [1] 2.076169e-07

  > summary(Children_c <- glmer(children ~ meanCenter(spouses) + byear + urban + male + paternalage.mean +paternalage.diff + (1|idParents), data= rpqa,family='poisson',control=glmerControl(optimizer="bobyqa") ))

  Warning messages:
  1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
    Model failed to converge with max|grad| = 0.129715 (tol = 0.001)
  2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
    Model is nearly unidentifiable: very large eigenvalue
   - Rescale variables?
   > relgrad <- with(Children_c@optinfo$derivs,solve(Hessian,gradient))
   > max(abs(relgrad))
   [1] 2.818414e-06

@bachl
Copy link

bachl commented Apr 11, 2014

I also checked some models that showed convergence warning messages. The models are fitted to quite large datasets and have crossed random factors. I used the following function to collect the results from Ben's test:

f_relgrad_chk = function(x) {
  cll = deparse(x@call, width.cutoff=500)
  conv_message = ifelse(is.null(x@optinfo$conv$lme4$messages), NA, x@optinfo$conv$lme4$messages)
  rel_grad = with(x@optinfo$derivs,solve(Hessian,gradient))
  max_rel_grad = max(abs(rel_grad))
  data.frame(cll, conv_message, max_rel_grad)
}
do.call(rbind, lapply(model.list, f_relgrad_chk))

Overall, I get small (<0.001) values for max(abs(relgrad)), and the results of all models are sensible.

I wrote the output to two csv-files:
https://dl.dropboxusercontent.com/u/3262123/conv_check2009.csv
https://dl.dropboxusercontent.com/u/3262123/conv_check2013.csv

Each file contains the messages for two models, which are expanded in several steps. Let me know if you need more information on the datasets or the models.

Hope that helps.

@drammock
Copy link
Contributor

As @bbolker requested on the mailing list back in April, here's a report regarding convergence warnings. My data are 69120 observations of a binary response, of which only about 4% are 1s. The model has four binary fixed effects predictors (with interactions) and a random intercept for "subject" (12 levels).

glmer(pressed ~ targetPresent*adj*idn*six + (1|subj), data=wl, family="binomial")
# Model failed to converge with max|grad| = 0.0121432 (tol = 0.001)
relgrad <- with(mixmod1@optinfo$derivs, solve(Hessian, gradient))
max(abs(relgrad))
# 0.0001775245

So in my case yes, scaling the gradient by the hessian did work.

@bbolker
Copy link
Member Author

bbolker commented May 23, 2014

thanks!

@samfranks
Copy link

To follow up on this mail list post, I had a different convergence warning:

dm1 <- glmer(presence ~ desig*habitat + elevation + spei.Mar + spei.Aug + (1|name/mgroup/id), data=dat, family=binomial, control=glmerControl(optimizer="bobyqa"))
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

My data consist of 103301 obs. The variables in the interaction are categorical (one with 3 levels, one with 5 levels), the other 3 are continuous variables.

Testing max(abs(relgrad)) returns

relgrad <- with(dm1@optinfo$derivs,solve(Hessian,gradient))
max(abs(relgrad))
0.000397608

which seems fine to me. Though not sure if perhaps another test is required for this particular convergence warning, if max(abs(relgrad)) is specific to checking the max|grad| warning? Any suggestions welcome @bbolker!

@bbolker
Copy link
Member Author

bbolker commented Apr 2, 2015

Commenting on current status.

There was an e-mail exchange which I can longer dig up where Doug (or others) suggested (1) scaling by the likelihood at the initial evaluation (potential problem: especially horrible starting points could throw off the scaling) (2) scaling by number of observations.

I was going to say that it would be really nice to have a systematic investigation of this, but I've now started one: see R code and PDF.

Some conclusions:

  • gradients at convergence do indeed scale with sample size, more or less (this is with the default bobyqa, would be worth trying with nlminb for comparison)
  • I expect that this is a fairly well-behaved example and that other cases might (?) do worse, but I'm imagining that it might make sense to set the threshold to something like (nobs_1e-7) or (nobs_1e-6)
  • our naive Hessian calculation clearly becomes unreliable for nobs>10^4 or so. (Why?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants