New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better convergence tests #120
Comments
In case you want to do more in this respect, I have pasted in the relevant code from ordinal below (for context see pkg/ordinal/R/convergence.R on R-forge). This function is implemented for CLM's without random effects - implementation for CLMMs is still on my to do list. That would require dealing with the issues related to theta-values on the boundary and checking positive definiteness of the individual random-effects variance-covariance matrices. Note that negative convergence values indicate convergence failure and positive values indicate several levels of success. For instance Note also that the Feel free to use this for inspiration or copy what you feel is relevant. Comments and critique is of course more than welcome. Cheers,
|
Might also be worth linking to http://www.inside-r.org/packages/cran/brglm/docs/separation.detection , or considering its strategy (from http://article.gmane.org/gmane.comp.lang.r.general/300335 ) |
Various rambling thoughts:
|
Yesterday I pushed a new version of lmer.tex to our private repository. In there I describe the calculations for the analytic gradient for the profiled log-likelihood or REML criterion. In general this is not a big win because it involves either solving It seems that cases of poor convergence are often associated with vector-valued random effects. I'll take a look at whether it is reasonable to evaluate the analytic gradient for the case of a single, vector-valued random-effects term. |
I think that evaluation of the "difficult" term in the expression for the gradient will be relatively easy in the case of a single, vector-valued, random-effects term if L and ZtZ are packed into 3-dimensional arrays. The point is that L is block-diagonal. I managed to work it out in the LMMVector1 representation in the Julia MixedModels package and it does give the correct values julia> gradient(f, th)
10-element Array{Float64,1}:
-14835.2
740.121
481.964
1068.77
-12533.1
1007.36
1801.64
-12864.3
422.39
-13846.4
julia> res - syr2k('L','N',1.,Ztr,fm1.u)/scale(fm1,true)
4x4 Array{Float64,2}:
-14835.2 -2.92906e-5 -0.000865546 0.0115895
740.12 -12533.0 -0.0199377 -0.00443789
481.963 1007.36 -12864.3 -0.0188079
1068.78 1801.64 422.39 -13846.5 This is the gradient at the starting estimates for ML in the simulated example sent by Laura Bringmann. An individual evaluation of the objective (negative twice the log-likelihood) takes about 0.5 sec on my desktop so the central finite-difference gradient takes close to 9 seconds. julia> @time f(th)
elapsed time: 0.466127221 seconds (63218368 bytes allocated)
-19516.97535188828
julia> @time gradient(f, th);
elapsed time: 8.667017285 seconds (1264367056 bytes allocated)
julia> @time gradient(f, th, :forward);
elapsed time: 4.763594825 seconds (695402176 bytes allocated) I think the analytic gradient evaluation will be much faster than the finite-difference evaluation. I'll code it up now. |
I got it coded up in Julia for the particular case of a single, vector-valued random-effects term. At the ML estimates for the simulated example sent by Laura Bringmann I get julia> @time grad(fm1) # analytic gradient
elapsed time: 0.111372923 seconds (20682456 bytes allocated)
10-element Array{Float64,1}:
-0.0601314
-0.0343705
-0.00175826
-0.0214015
-0.00521561
-0.0271165
-0.0339831
0.0374451
-0.0356073
0.0844693
julia> @time gradient(f, th) # central finite-differences
elapsed time: 8.697922726 seconds (1264367056 bytes allocated)
10-element Array{Float64,1}:
-0.0601344
-0.0354681
-0.000934209
-0.023643
-0.00521386
-0.0267069
-0.0379575
0.0374439
-0.0370824
0.0849209 One of the gradient-based optimizers from NLopt does a great job on this problem now. julia> dd = DataFrame(read_rda("/var/tmp/dd.rda")["dd"]);
julia> fm1 = lmm(:(Y ~ X1 + X2 + X3 + (X1 + X2 + X3|indiv)), dd);
julia> fit(fm1,true,:LD_MMA)
f_1: -19516.97535, [1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0]
f_2: -39241.77318, [1.9,-0.9,-0.9,-0.9,1.9,-0.9,-0.9,1.9,-0.9,1.9]
f_3: -55518.44113, [2.8,0.0,0.0,0.0,2.8,0.0,0.0,2.8,0.0,2.8]
f_4: -58876.02262, [3.58235,-0.358756,-0.28832,-0.421558,3.56465,-0.411242,-0.509618,3.56746,-0.26706,3.57527]
f_5: -60690.71466, [4.47944,0.362396,0.429531,0.318362,4.45232,0.283756,0.163937,4.45616,0.388081,4.46659]
f_6: -61952.03137, [5.55944,-0.246584,-0.178461,-0.289818,5.53232,-0.318704,-0.442857,5.53616,-0.210779,5.54659]
f_7: -62586.2446, [6.85544,0.101953,0.194459,-0.50237,6.82832,-0.612018,-0.85953,6.83216,0.143818,6.84259]
f_8: -62897.37305, [8.41064,-0.206747,-0.114241,0.0268298,8.38352,-0.0828178,-0.33033,8.38736,-0.164882,8.39779]
f_9: -63004.91419, [10.2769,-0.422837,-0.330331,-0.34361,10.2498,-0.453258,-0.70077,10.2536,-0.380972,10.264]
f_10: -62986.31918, [12.1431,-0.60186,-0.422365,-0.709167,11.8857,-0.814802,-1.07121,12.1198,-0.529517,12.1303]
f_11: -62992.28445, [12.1431,-0.520373,-0.350615,-0.627009,11.2443,-0.727098,-1.00885,11.9484,-0.439437,12.1087]
f_12: -63002.29961, [11.8773,-0.445522,-0.332703,-0.477038,10.5258,-0.574723,-0.869848,11.3596,-0.390401,11.6991]
f_13: -63008.64808, [11.2363,-0.425757,-0.330592,-0.371123,10.288,-0.476764,-0.743155,10.6242,-0.382067,10.973]
f_14: -63009.0832, [11.1644,-0.427679,-0.331244,-0.385482,10.2641,-0.489801,-0.767321,10.339,-0.383067,10.6628]
f_15: -63009.18824, [11.1915,-0.429101,-0.33164,-0.396257,10.2742,-0.499401,-0.785592,10.4396,-0.38369,10.7371]
f_16: -63008.72809, [11.1231,-0.447147,-0.337497,-0.436296,10.239,-0.53592,-0.845403,10.063,-0.392821,10.4645]
f_17: -63009.23308, [11.1803,-0.432239,-0.332587,-0.404232,10.2685,-0.506516,-0.799106,10.3629,-0.385195,10.6856]
f_18: -63008.7894, [11.2622,-0.468436,-0.344374,-0.478047,10.3156,-0.574504,-0.902504,10.8153,-0.403757,11.0101]
f_19: -63009.34215, [11.1975,-0.440596,-0.335008,-0.424752,10.2781,-0.524873,-0.832489,10.4992,-0.389132,10.7716]
f_20: -63009.07383, [11.0968,-0.518742,-0.366001,-0.564015,10.2152,-0.655587,-1.0122,10.0245,-0.437333,10.3941]
f_21: -63009.61615, [11.1718,-0.464273,-0.342831,-0.47844,10.2628,-0.573593,-0.913704,10.2994,-0.40213,10.6322]
f_22: -63009.61687, [11.3248,-0.614987,-0.398528,-0.716269,10.3853,-0.798486,-1.19986,10.744,-0.500564,11.0104]
f_23: -63010.42543, [11.1063,-0.588566,-0.38274,-0.763461,10.1778,-0.832115,-1.31589,10.3703,-0.504381,10.6507]
f_24: -63010.51364, [11.2461,-0.568719,-0.368417,-0.767037,10.3619,-0.823659,-1.36279,10.4502,-0.492815,10.7235]
f_25: -63010.55464, [11.1993,-0.565832,-0.365924,-0.772123,10.2892,-0.825778,-1.38783,10.4048,-0.490288,10.6854]
f_26: -63010.56045, [11.1996,-0.562139,-0.363411,-0.774561,10.3008,-0.825208,-1.40431,10.4121,-0.487716,10.6894]
f_27: -63010.56294, [11.1996,-0.52455,-0.338353,-0.802165,10.2944,-0.82281,-1.53433,10.4054,-0.461637,10.6866]
f_28: -63010.56942, [11.1979,-0.543388,-0.358481,-0.788481,10.3249,-0.822208,-1.4808,10.4167,-0.47848,10.6913]
f_29: -63010.57076, [11.2001,-0.544991,-0.356392,-0.787165,10.295,-0.824967,-1.48203,10.4078,-0.478137,10.6876]
f_30: -63010.57118, [11.1994,-0.544498,-0.355878,-0.786782,10.3042,-0.823787,-1.4799,10.4105,-0.477827,10.6885]
f_31: -63010.57104, [11.1989,-0.544692,-0.354937,-0.785012,10.3088,-0.823231,-1.47352,10.4116,-0.477376,10.6892]
f_32: -63010.57118, [11.1994,-0.544518,-0.355782,-0.786603,10.3047,-0.823731,-1.47925,10.4106,-0.477782,10.6886]
f_33: -63010.57114, [11.1988,-0.544666,-0.354751,-0.786355,10.301,-0.823331,-1.47845,10.4109,-0.4773,10.6888]
f_34: -63010.57118, [11.1993,-0.544533,-0.355679,-0.786578,10.3043,-0.823691,-1.47917,10.4107,-0.477733,10.6886]
f_35: -63010.57118, [11.1993,-0.54454,-0.355633,-0.786567,10.3046,-0.823668,-1.47912,10.4107,-0.477712,10.6886]
f_36: -63010.57118, [11.1993,-0.544547,-0.355604,-0.786561,10.3045,-0.823658,-1.4791,10.4107,-0.477699,10.6886]
FTOL_REACHED
Linear mixed model fit by maximum likelihood
logLik: 31505.285592, deviance: -63010.571184
Variance components:
Variance Std.Dev.
indiv 1.254095 1.119864
Residual 1.081380 1.039894
Number of obs: 40000; levels of grouping factors: 200
Fixed-effects parameters:
Estimate Std.Error z value
[1,] 5.14017 0.0788611 65.18
[2,] 5.04852 0.0726618 69.4797
[3,] 4.99547 0.0735801 67.8915
[4,] 5.12455 0.0762583 67.1999 (Don't pay attention to the description of the variance components - they're not being displayed properly right now.) A comparable fit with a derivative-free optimizer (BOBYQA as implemented in NLopt) required about 850 function evaluations which, of course, made it much slower. |
On hessian computations for glmms: I haven't use the I decided to split some of the tools functions in ordinal into a separate package, Rufus (Rune's utility functions) currently on R-Forge under the ordinal project, but which I am contemplating releasing to CRAN. Rufus has the simple central difference gradient and hessian functions, and so if you don't want those functions to live privately in lme4, you could grap them from here. (They are in my view fairly small self-contained functions that could easily live privately in lme4.) |
Thanks, Rune, that makes sense (it's sort of too bad that |
Yes, I made that suggestion to Paul Gilbert, but he wasn't interested. If you grab |
A few further thoughts on setting tolerances for gradient checking, just so they're documented somewhere.
|
still thinking about Rune's |
This is getting there. I don't want to close the issue at least until I've implemented some tests (which don't exist yet). Also, I think I may have the gradient tolerance set too low (right now it warns if max(|grad|)>0.001). A lot of Nelder-Mead examples fail, for example the standard |
all these "absolute" scales 1e-4, 0.001 are they really applied to "scale free" (or "relative") criteria? |
Yes, in this particular case. For LMMs all of the parameters are 'theta' values, which are scale free/relative. It's not the case for GLMMs, where the betas aren't necessarily scaled. This is part of the difficulty of coming up with good general rules ... |
Hmm. Can't replicate the issue I commented about above (gradient of -0.0012 with Nelder_Mead). Doug's analytical gradient stuff above is very exciting. Perhaps I'll see if Gabor Grothendieck can work on this next month ... In principle I think we ought to be careful not to check gradients for singular cases (if the optimal fit is on the boundary, there's no reason the gradient shouldn't be large in directions normal to the active boundaries). However, I haven't been able to find a test case yet -- in the singular examples I've looked at, the gradient also goes to zero at the boundary (unless I've made a mistake computing the gradients in the boundary case ...) |
Hi, am I understanding this discussion correctly as meaning that your new convergence checks are sensitive to sample size? Sorry to interject, I can make a new issue if that's better. I am getting some new convergence warnings on models that used to run fine, with everything that I'm used to checking looking fine (scaled my predictors, no unity correlations, no zero variances). I could replicate this using sample data (sleepstudy and Penicillin) by increasing the sample size. Same pattern here, warning is triggered but model seems fine (to my probably too naïve checks).
|
We have not done any systematic testing, but anecdotally it does seem that there's a tendency for more convergence warnings with bigger problems, although now that I think of it I have noticed more issues with large parameter sets than with large data (although the two are correlated in practice). Perhaps the discrepancy between the convergence criteria for our derivative-free optimizers and the size of the finite-difference gradient increases with problem size? (Naively I would have expected it to go the other way ...) Your example is both useful (because it gives us a way to isolate one effect for a single model structure) and disturbing (because I thought that most of the convergence warnings were occurring with |
If this helps: My use case where I started seeing convergence warnings after coming back to older analyses a few months later where definitely both large data & large parameter sets (i.e. at it largest 300k individuals in 40k families). It was the above warning in all my models, using both lmer and glmer with a binomial and a Poisson outcome. I also used bobyqa optimisers for some models and after googling my warnings I also tried one that requires optimx (can check when I'm back at work) at it. |
Still working on/thinking about this. I still don't have a strong intuition for why increasing sample sizes should be causing problems, but I have a guess: perhaps the gradients should probably scaled relative to the curvature when testing convergence. That is, instead of the gradient, we should be computing the expected linear change over one standard error of the parameter? This is admittedly a wild guess, but it has a few nice features:
The right way to handle the problem would be to look closely at the convergence criteria for the various optimizers and figure out how they can stop when the derivatives are still non-trivial (by some definition). |
This is beyond my level of understanding, I'm afraid. Can I just ask: If convergence warnings seem to be a bit trigger-happy at the moment, would you recommend other checks besides the ones I mentioned ? |
As I understand you, you are talking about what I would call a relative convergence criterion. An approximate error estimate of a current estimate/iterate, is given by the size of the Newton step (I know this as the method independent error theorem) close to the optimum. And the Newton step is exactly the gradient scaled by the variance-covariance matrix of the parameters evaluated at the current iterate. You compute the Newton step/relative criterion with something like I generally think that it is nice to know how close one is to the optimum both in an absolute and a relative sense. In my mind they complement each other, but for any given application one may of course be more important than the other. IIRC the relative criterion is somewhat similar to the Bates' orthogonality criterion, but I don't have my notes with me to check right now. I don’t remember all the details, but IIRC, there might be a problem with the relative criterion for parameters with large standard errors. In particular if they are close to zero, the error in those relative to their size (not uncertainty) can be substantial. You might argue that one cannot expect such estimates to be accurate given the substantial statistical uncertainty, but my feeling is that (some) people will expect parameters to be accurately determined irrespective of the statistical uncertainty. |
I played around with the Penicillin case and noticed some unexpected differences in the derivative calculations comparing to the numDeriv results - especially for the Hessian:
So taking the Newton step does indeed lowers the deviance and decreases the gradient. Not so using the deriv12 calculations:
I also noticed that in this case the Hessian calculation seemed very sensitive to the size of
That makes me wonder what the real hessian is, and how the likelihood/deviance function looks when the curvature is this sensitive to the scale that you look at it. A couple of thoughts: |
Quick comments:
|
I would still like to think about this more, but I'm leaning very strongly towards switching to a relative convergence criterion (as suggested by Rune), |
I noticed today that the usual model fit to the |
Hmmm. On my system I have
When I do it with REML instead I get
which would be enough to trigger a warning if we still had the tolerance set at 0.001, but On the other hand, the relative error is much smaller:
(I have modified my local copy to check the relative gradients rather than the gradients, but investigating I certainly understand that the deviance should be roughly proportional to the number of observations (hadn't thought about the large number of random effects), but that doesn't immediately tell me that the convergence achieved by nonlinear optimizers should be worse in absolute terms ... Should I check in the change to relative tolerance? |
Let me know if you want me to install a feature branch and test this on my real data.
|
thanks. An even simpler test would be to take a fitted example that gave you convergence warnings and take a look at the results of
and see if it's reasonably small (e.g. <0.001?) |
It is looking good. I can run some more tests when I return from a conference on Friday. I am also getting some "very large eigenvalue" warning in some models, I'm attaching the full output. The variables are all scaled properly, but it is zero-inflated data, so that warning may be proper.
|
I also checked some models that showed convergence warning messages. The models are fitted to quite large datasets and have crossed random factors. I used the following function to collect the results from Ben's test:
Overall, I get small (<0.001) values for max(abs(relgrad)), and the results of all models are sensible. I wrote the output to two csv-files: Each file contains the messages for two models, which are expanded in several steps. Let me know if you need more information on the datasets or the models. Hope that helps. |
As @bbolker requested on the mailing list back in April, here's a report regarding convergence warnings. My data are 69120 observations of a binary response, of which only about 4% are 1s. The model has four binary fixed effects predictors (with interactions) and a random intercept for "subject" (12 levels).
So in my case yes, scaling the gradient by the hessian did work. |
thanks! |
To follow up on this mail list post, I had a different convergence warning:
My data consist of 103301 obs. The variables in the interaction are categorical (one with 3 levels, one with 5 levels), the other 3 are continuous variables. Testing max(abs(relgrad)) returns
which seems fine to me. Though not sure if perhaps another test is required for this particular convergence warning, if max(abs(relgrad)) is specific to checking the max|grad| warning? Any suggestions welcome @bbolker! |
Commenting on current status. There was an e-mail exchange which I can longer dig up where Doug (or others) suggested (1) scaling by the likelihood at the initial evaluation (potential problem: especially horrible starting points could throw off the scaling) (2) scaling by number of observations. I was going to say that it would be really nice to have a systematic investigation of this, but I've now started one: see R code and PDF. Some conclusions:
|
In Banff, Rune commented that he had a suite of convergence tests for GLMMs in the
ordinal
package. We should think about adapting/stealing these; also worth looking at the KKT criterion calculators from John Nash's packages (not sure which package they're in -- I don't think they're inoptimx
any more).On a related topic, this test for misspecification -- http://dx.doi.org/10.1016/j.csda.2008.02.033 looks interesting.
The text was updated successfully, but these errors were encountered: