On fits to correlated and auto-correlated data

Observables in particle physics and specifically in lattice QCD calculations are often extracted from fits. Standard $\chi^2$ tests require a reliable determination of the covariance matrix and its inverse from correlated and auto-correlated data, a challenging task often leading to close-to-singular estimates. These motivate modifications of the definition of $\chi^2$ such as uncorrelated fits. We show how the goodness-of-fit measured by their p-value can still be estimated robustly for a broad class of such fits.


Introduction
In particle physics, physical information is often extracted from fits to correlated data. When their covariance matrix is known well, the minimization of the so-called correlated χ 2 is the ideal method to obtain the best estimate of the fit parameters from the statistical point of view. In many cases model functions are guided by theoretical principles, but in practice one has to examine their validity case by case, and often specific model functions or data points are discarded based on a measure of their goodness-of-fit. For N x data points normally distributed, and for correlated fits to model functions depending on N A parameters, the expectation value of χ 2 (at the minimum) equals the number of degrees of freedom, N x − N A . When there are many more fitted observables than fit parameters, a reduced χ 2 close to 1 (defined by the ratio of the observed χ 2 over (N x − N A )), provides a first good indication that a given model describes well the data. In addition, the goodness-of-fit is judged by computing the probability Q, often called p-value, of observing a χ 2 larger or equal than the observed χ 2 , which we call χ 2 obs . The p-value is easily obtained in closed analytic form due to the exact cancellation between the covariance matrix, C, of the underlying data and C −1 used in the definition of the correlated χ 2 . It depends only on χ 2 obs and on N x − N A . However, data sets with limited statistics often lead to close-to-singular C −1 . Moreover, for data generated from Monte-Carlo (MC) methods, specifically from Markov chains, additional complications arise from the presence of auto-correlations, namely correlations along the Monte-Carlo history (or "time"). Hence for all cases where reliable estimations of the inverse of the covariance matrix are not possible, and correlated fits can not be performed, the standard χ 2 -test is not applicable. In this work we point out a robust solution to the problem.
Our object of study is primarily lattice QCD, where Monte-Carlo methods based on Markov chains are an essential ingredient to obtain non-perturbative predictions. Field configurations are generated by successive repetitions of a predefined update scheme, and as a consequence, expectation values of correlation functions, obtained from averages over measurements on these field configurations, are both correlated and auto-correlated. The Γ-method is the preferable choice to estimate the statistical errors of primary and derived observables [1] obtained from a Markov process, but it is difficult to turn it into a practical method for general matrix elements of the covariance matrix. 1 Because of these issues and the general difficulty in estimating covariances for large number of data points [2], ad hoc modifications of C −1 are often used, like so-called SVD cuts (see e.g. [3,4]) or uncorrelated fits. We will show how the expected value of χ 2 and the goodness-of-fit can be estimated robustly for all these cases. In fact, we emphasize that this can also be done when the weights of the data points are independent of the covariance matrix, which may be a good strategy when fits are used to perform extrapolations away from the region where data exists.
Our text is organized as follows: in Section 2 we present the derivation of the formulae for the expectation value of χ 2 and the correponding p-value; in Section 3 we describe how to handle auto-correlations, while in Section 4 we present a numerical test in a toy model, before concluding.

Method
We want to test whether data Y j at kinematic coordinates x j are described by model functions Φ(x, a), depending on the parameters a α , with A α the exact values of the model parameters: the so-called null-hypothesis. In our convention Greek indices run from 1 to N A , Roman ones run from 1 to N x , repeated indices are summed over, and we use the natural scalar product (r, s) = r i s i and norm ||r|| = (r, r) 1/2 . We also often use the matrix-vector notation to suppress Roman indices. Given dataȳ j with normal distribution, with covariance matrix C, we want to find robust estimates for the fit parameters and have a measure for the likelihood that our model is compatible with the data. Expectation values are defined in the usual manner. In particular we have One possibility when such a normal distribution is approximately realized is whenȳ j are averages over Monte Carlo data of length N and N is sufficiently large. In this case, all elements of C are of order 1/N , and ȳ i → Y i in the limit N → ∞. The natural measure for the distance of the model to the data is in terms of a symmetric and positive N x × N x weight matrix W ij . A reasonable requirement is that χ 2 remains finite as N → ∞, which means that W should be of O(N 1/2 ): common choices for W 2 are the inverse covariance matrix C −1 or the inverse of its diagonal projection W ij = δ ij / √ C ii (no summation over i). The latter definition of χ 2 leads to so-called uncorrelated fits. However, one can also consider weight matrices W which scale weaker with N , e.g. O(N 0 ). We will remark on that later.
As usual, from the minimization of the χ 2 function we obtain the best estimate of the parameters χ 2 (ā) = Min a χ 2 (a) ,ā =ā(ȳ) . (2.5) Here we assume that the positionā(ȳ) of the minimum of χ 2 is unique. Our goal is to derive an expression for the expectation value predicted by the normal distribution of the data and under the assumption that Eq. (2.1) holds. The condition forā, can be rewritten as with P the projector onto the space spanned by {W φ α }, namely and we have (2.10)

Expectation value of χ 2
We expand φ around a =ā, and obtain, using (1 − P)W φ α (ā) = 0, Taking δā = O(N −1/2 ) = δȳ, the expected value of χ 2 is then We stress the fact that in the result above even an approximate knowledge (up to statistical errors) of the covariance matrix is sufficient. The corrections to eq. (2.13) are seen to be O(N −1 ) in the following way. Expanding δā(δȳ) in terms of δȳ (and therefore in terms of due to the two factors of W . In the assumed symmetric (normal) distribution δȳ i δȳ j δȳ k vanishes and skewed corrections to it are down by a factor N −1/2 leading overall to O(N −1 ). However, in practice the estimate of C w from the data themselves introduces order N −1/2 uncertainties, as well as the projector P which is a function ofā α . The essential point is that χ 2 (ā) can easily be computed also with auto-correlated data, as we will show in the next section. For the special choice of an "uncorrelated fit", i.e. (2.14) Similarly, for a "correlated fit" with Above we used W = O(N 1/2 ). However, one could also consider a more general functional dependence of W on C, for instance by taking W independent of C 2 . This changes nothing in the above, since δā remains O(N −1/2 ). In Appendix C we discuss several other generalizations of the fit functions assumed above.
At this point, it is natural to define the reduced-χ 2 for fits with arbitrary W as χ 2 / χ 2 (ā) with eq. (2.13) for χ 2 (ā) . Although a reduced-χ 2 of order 1 is already a first indication of whether a given model fits well the data, the real discriminator for the goodness of a fit is the p-value.

Quality-of-fit
The probability for having a value of χ 2 (ā) which is larger or equal than a given value χ 2 obs is It is useful to decide whether the model can be statistically rejected (the null hypothesis). In practice χ 2 obs stands for the value of χ 2 (ā) that one has in the particular fit to the existing data and Q gives the probability of finding a fit worse than the one that one found, i.e. with a larger value of χ 2 (ā). If Q is too low, then one has significant reasons to doubt that the model describes the data, and usually dismisses the fit and modifies the model or the fit ranges. The probability Q is often called quality-of-fit or p-value.
Using the variables z = C −1/2 δȳ it is given by in terms of the positive, symmetric matrix The latter takes over the role of the number of degrees of freedom in correlated fits, where CW = 1 and tr[ν] = N x − N A . Denoting the strictly positive eigenvalues of ν by In this form, Q is easily evaluated by a MC over z i drawn from a Gaussian distribution. A sample of order thousand random numbers is sufficient for a precision of Q of order 0.01, and our proposal is to use this as estimate for the quality-of-fit. The difficult part is not the statistical error in the MC estimate of eq. (2.18) but the fact that ν has to be estimated from the dataȳ and thus the eigenvalues λ i (ν) are uncertain themselves. This leads to the expectation that the statistical error ∆Q of Q is order N −1/2 . In contrast to the case of a correlated χ 2 fit no enhancement by small eigenvalues (of the estimated ν) will occur in ∆Q. Instead the contribution of modes with small eigenvalues are suppressed. In sect. 4 we will demonstrate in a toy model that ∆Q is small and can be neglected in practice.
In general one should decide whether a fit is to be discarded based on the quality-of-fit. For example, if Q(χ 2 obs , ν) = 0.05 there is only a 5% chance that one finds such a χ 2 obs or worse but the fit-function describes the data on average. One will then dismiss fits with Q(χ 2 obs , ν) ≤ 0.05 or with a different threshold. When one performs a correlated fit and has a large number of data, more precisely when . But this is not the case in general. To understand this behavior, we examine the region where Q makes the transition from Q ≈ 1 to Q ≈ 0. Its width can be estimated by the variance of χ 2 (ā). Using eqs. (2.12, 2.13) we obtain valid for arbitrary weights W . For a correlated fit with W CW = 1, the width is suppressed by For a large number of data points, a step-function behavior is approached. However, in general this is not the case. Assume for example that the eigenvalues of ν are geometrically distributed, namely λ j = λ 0 r j with r < 1. Then we have In general the width is therefore not suppressed at all and it is necessary to examine the p-value.
For an illustration the reader may look ahead at Q-functions in our toy model in Fig. 3.

Auto-correlated Monte Carlo data
In the above formulae the covariance matrix is assumed to be known, but in practice we have to replace it by an estimate from the Monte Carlo data, often in presence of non-negligible autocorrelations, which we discuss next.
We consider explicitly the case where MC data from Markov chains with different update schemes or different equilibrium distributions enter the fit. The specialization to multiple MC chains with the same update scheme, often called replica, is straightforward. On a single MC chain labelled by E we assume to have estimators p E i (t) at MC time t of several quantities labelled by i, such thatp converges to P i for N E → ∞. In general, the Y -variables in our fits are functions of primary observables P , estimated byȳ j = η j (p). Note that if different chains yield estimators for the same observable P i , we just replace P i by the (weighted) average over the different chains. This average is then considered as part of the function η j .

Γ-method
The preferred method for dealing with auto-correlations in the error estimate ofp E i is the so-called Γ-method [1,5]. Different error estimators are discussed in Appendix D. We refer the interested reader to Ref. [6] for a proposal to estimate χ 2 (ā) and p-value based on bootstrap resampling, from data which are blocked in MC time with a block-length large compared to the exponential auto-correlation time (see also Refs. [7,8] for practical applications).
The Γ-method inherits its name from the matrix valued auto-correlation function Here the average . E may be defined as the t 0 -average of an infinitely long chain. Linearizing the error propagation and applying the chain rule as in [1] leads to the same simple form above (∆p → ∆y) also for the auto-correlation function of derived quantities, with the linearized fluctuations given by Since the covariance matrix of the variablesȳ E i reads the expected χ 2 , eq. (2.13), becomes in presence of auto-correlations. With finite length MC data the above formulae need to be supplemented by summation windows in t, over which the MC estimate of Γ E ij (t), denoted by (3.10) In both cases the standard automatic windowing procedure of Ref. [1] or the one of Ref. [9] may be applied simply replacing Γ F (t) in Ref. [1] by tr[Γ(t) W PW ] or tr[Γ(t) W (1 − P)W ], while nothing changes in the discussions in Refs. [1,9]. Note that the knowledge of the full covariance matrix is not required. Its presence is captured by the traces in eq. (3.8) and eq. (3.10). We may use the same automatic determination of the windows ensemble by ensemble and possibly the addition of a tail [9] due to large auto-correlations and will have balanced statistical and systematic errors (from the truncation of the t-sum) of size Assuming that the required w E are not much different, for strong correlations (not auto-correlations) eq. (3.8) will be more precise, while in the opposite case eq. (3.10) will be better. Obviously one may just evaluate both and take the one with the smaller error. The covariance matrix may be estimated from Γ E,MC ij by truncating the t-sum in eq. (3.5) to the summation windows w E obtained from the automatic procedure for E E f,s . The p-value is then obtained from eq. (2.18). In Sect. 4 we demonstrate in a toy model that this is a valid approach, namely the p-value is stable against increasing the window size. Note that only modes with λ(ν) > 0 enter eq. (2.18).
Two remarks on the error estimates, eq. (3.11) are in order. These are of the type errorof-the-error. As is common [1,5,9], we use the Madras-Sokal approximation, which consists of neglecting the connected parts of a dynamical four-point function. We are not aware of a general argument on the quality of this approximation, but F. Virotta found that it works quite well in a simple interacting field theory [10]. Secondly eq. (3.11) neglects the statistical error of P which is a function ofā and of the same order in N −1 as ∆E f,s . In Appendix A we provide the analytic formula for its evaluation and in all our experiments we find it to be subdominant compared to the terms in eq. (3.11).
Note that when the data can be partitioned in sufficiently many and large blocks, i.e. when autocorrelations are not an issue, one may consider the bootstrap approach to get an estimate of the error of the p-value.

Toy model
As an illustration, we consider the free one-component scalar theory with standard action, on a T × L d torus (periodic boundary conditions) discretized on a hyper-cubic lattice with spacing a and standard discrete forward derivative ∂ µ . Our observables are the often used time-momentum correlation functions In absence of interactions, the Green function G φφ on a torus with finite time extent T is given by in terms of the infinite T propagator where ω(p) satisfies the lattice dispersion relation and p i = 2π L k i , k i = 0, 1, . . . L/a − 1. A further correlator is where the second term is a disconnected contribution which dominates at long distances. On a given MC configuration, one may estimate G φφ from and for the case of negligible auto-correlations, i.e. measurements separated by MC time much larger than the exponential auto-correlation time, the covariance matrix of N measurements of the estimators E(t, p) takes the simple form (g(t) ≡ G φφ (t, p)) We note that off-diagonal matrix elements are exponentially suppressed by e −|t−t |m and there is the typical strong exponential noise/signal enhancement at large t ≈ t . For the noisy G φ 2 φ 2 correlator, we employ an estimator with a maximal volume average,

Simulation
To illustrate our method on auto-correlated data similar to standard lattice QCD calculations, we simulated the free theory in 1+1 dimensions with the following algorithm 3 . We performed a MC iteration composed of a Hybrid Monte Carlo (HMC) trajectory of length 4 molecular dynamics units [13] followed by one Metropolis sweep. Both the integration of the equations of motion in the HMC and the width of the Metropolis proposal were tuned to have ≈ 80% acceptance. The large trajectory length was chosen to tame auto-correlations and the Metropolis sweep was added to avoid exceedingly large auto-correlations, specific to the free theory [14]. Setting am = 0.075, we     4,26]. Several replica are considered and error bars are obtained from the fluctuations over the replica. The left and right panels show the estimator E f (w), eq. (3.8), and the p-value for several choices of χ 2 obs as a function of the window size w in units of measurements.
We now investigate the determination of the lowest energy level by a fit to the correlation function G conn It is asymptotically dominated by the (2-particle state, each with momentum p = 0) level aE exact = 2aω(0) = 2 cosh −1 (1 + a 2 m 2 /2) ≈ 0.15. The higher levels are two-particle states with non-zero momentum of |p| ≥ 2π/L. This leads to a gap of a∆E ≈ 0.2. We study the typical approach to the problem of extracting the lowest level through fits to the functional form A fit e −m fit t to G conn φ 2 φ 2 (t). Of course these suffer from the contaminations O(exp(−t∆E)) which are neglected in the fit function. Furthermore the signal-to-noise problem is quite severe for G conn φ 2 φ 2 (t). We therefore consider fits only up to an upper cut, which we set to t max /a = 26.
We demonstrate the determination of χ 2 (ā) in Fig. 2 for a typical uncorrelated fit. Neglecting auto-correlations (w = 0) leads to a significant underestimate of χ 2 (ā) . However, this is no issue, because one can easily sum up to an appropriate window size w, even with only 2000 measurements as used here. In fact, the window summation of χ 2 (ā) saturates earlier than the one of typical observables, e.g. φ 2 , Fig. 1.

p-value
We can identify statistically good fits as those which have an acceptable p-value, say Q 0.05. For a correlated fit with a known covariance matrix we have ν = P, and thus χ 2 (ā) = N x − N A . Furthermore, its p-value is simply given by in terms of the upper incomplete γ-function. In Fig. 3 we compare this particular case to Q(χ 2 obs , ν), eq. (2.18), for uncorrelated fits. Several replica of 2000 measurements are used. For each replicum the computation of χ 2 (ā) employs the automatic windowing procedure [1] for E f . The determined window w opt is used to estimate the matrix ν from the covariance matrix. The uncertainty of Q due to the statistical fluctuations of ν is then estimated from the width of the replica-distribution. This very small uncertainty is given by the widths of the solid lines in the left panel of Fig. 3.
While in correlated fits with a large number of degrees of freedom, a criterion χ 2 / χ 2 (ā) ≈ 1 is a good discriminator between good and bad fits, in general the p-value needs to be considered. Indeed, for uncorrelated fits, it is even more important to base the acceptance on the p-value as demonstrated by Fig. 3. Acceptable p-values are still present for χ 2 obs significantly above χ 2 (ā) -also for large N x − N A . The right panel of the figure shows the Q-function for a different case, namely a fit to G φφ , where we use the analytic covariance matrix without uncertainties. Again values of χ 2 obs / χ 2 (ā) above 1.5 still have rather good p-values. Since Q is easily evaluated, there is no drawback in always basing the acceptance of a fit on it. Also auto-correlations are easy to control: Fig. 2 shows the dependence of Q on the summation window of the auto-correlation functions. One may again just use the w opt determined in the analysis of χ 2 (ā) .

Determination of the fit window
At this point we turn to the typical problem of finding the minimal value of the lower end of the fit range, t min , for which the single exponential ansatz describes well the data. We want to compare the use of the uncorrelated fits to correlated ones, but the covariance matrices estimated with the statistics of our test-runs are (typically) not positive. Therefore we generated a long MC chain with 40000 measurements solely to estimate C, and use W = C −1/2 in the definition of the correlated χ 2 . Of course, this is not a viable procedure in general while uncorrelated fits are always possible. In Fig. 4 we show one out of several replica of 4000 measurements each. We compare the results of fits to G conn φ 2 φ 2 (t) to the true asymptotic values of the parameters. When the p-values becomes acceptable the fitted parameters agree reasonably well with their exact asymptotic values. 4 This illustrates that uncorrelated fits can be judged by the computed p-value. In contrast, the naive goodness of fit criterion, χ 2 (ā) /(N x − N A ) 1 is useless, since χ 2 (ā) < 1 for all fits shown. We remind the reader that the correlated fits could be done only by estimating the covariance matrix with a 10-folded statistics. Quite often the data fluctuate more and in a correlated fashion around the true correlation function. One can then have fits with acceptable p-values, i.e. the fits describe the data over a certain range of a kinematical parameter (here t), but this does not guarantee that the asymptotic values of the model parameters (here A and E exact ) are determined. In our numerical experiment we have seen that in six out of ten cases A fit differed from the exact value by 30% to 50% and more than two standard deviations, even when the p-value was acceptable. In that sense, the shown replicum is a lucky case.
While we observe smaller statistical errors in the parameters from correlated fits for a fixed set of data points, uncorrelated fits tend to give acceptable p-values for smaller t min . No particular difference is then observed in the precision of the estimated best fit parameters.

Conclusions
Estimating the covariance matrix C from limited correlated and auto-correlated data, in particular the lower part of its spectrum, often proves to be challenging. Here we investigated a method that bypasses the need for the inverse covariance matrix and is numerically robust against fluctuations of its small eigenvalues.
By studying the expansion of the χ 2 function about its minimum, we have obtained expressions to compute the expected value of χ 2 and the corresponding p-value, for arbitrary positive weight matrices W . The formulae contain the covariance matrix in such a way that the upper part of the spectrum of C predominantly determines the expected χ 2 and the p-value. They therefore provide robust statistical tests for the fits.
We observe that it is important to use the p-value to discriminate between good and bad fits, not just the reduced χ 2 . An investigation in a toy model illustrates how the method works with moderate statistics and with auto-correlations present. Unfortunately, the p-values of uncorrelated fits are somewhat less sharply discriminating than the ones of correlated fits. This is the price that one has to pay for their far superior robustness against statistical fluctuations. Thus, when the correlation of the data is well known (very long MC chains), correlated fits are the preferred choice. Of course, one can also balance the pros and cons of the two extreme choices by using intermediate weight matrices W together with the proper p-value.

A Statistical error of P
Here we derive the explicit form of the projector and other useful relations. We begin from the minimum condition, which we rewrite as By multiplying on the left by φ βT (ā)W , we obtain which tells how to propagate the errors from the data to the fitted parameters. More specifically the covariance matrix ofā α reads Now combining together eq. (A.2) and eq. (A.1) one quickly obtains which satisfies all the properties outlined in the main text. The partial derivative combined with eq. (A.3) gives us the explicit form of the contribution to the error of χ 2 (ā) that originates from the dependence of P uponā α , which we neglect. In the derivation of the result above, we have used

B Reference implementations
A reference implementation in MATLAB/Octave and Python is publicly available at this link https://github.com/mbruno46/chiexp. Documentation on the syntax can be consulted online, https://mbruno46.github.io/chiexp, from the git repository, locally after downloading it or by using the corresponding help functions in the MATLAB or Python interactive sessions.

C Generalizations
In the main text we have considered the case of data described by a single fit function, but more complicated cases can be easily accommodated. When different sets of data Y (1) , Y (2) (with different sizes M 1 and M 2 ) are described by different models φ (1) , φ (2) with a few common parameters, it suffices to extend N x = M 1 + M 2 to incorporate all points, such that     If the two data sets are correlated, e.g. they originate from the same MC ensembles, but estimating the full N x ×N x covariance matrix and its inverse turns out to be impractical, the method proposed in this manuscript allows to arbitrarily regularize the covariance matrix in the definition of χ 2 (e.g. just consider the diagonal blocks M 1 × M 1 and M 2 × M 2 ) and still have a reliable statistical interpretation of the associated fit.
Similarly, there are situations where we might want to include the error of the kinematic coordinatesx i (for instance they might be obtained from averages over MC chains which may also be correlated with the data pointsȳ i ). Such cases are easily incorporated by considering the vectors in the definition of χ 2 = ||W r|| 2 . The matrix W 2 may correspond to various regularizations of the inverse of the extended covariance matrix C = δȳδȳ T δȳδx T δxδȳ T δxδx T , (C. 3) and nothing changes in the derivation of χ 2 (ā) and in the discussion on the goodness-of-fit for the model function Φ. In the equations above X i represent the true values of the kinematic coordinates, and δx i =x i − X i .

D.2 Priors and the Bayesian approach
Gaussian priors on the parameters, a special case of a Bayesian analysis, are easily taken into account by extending the r vectors for χ 2 = ||W r|| 2 to with a α and σ a α representing the prior and its width for the parameter a α . One can easily verify that for correlated fits χ 2 (ā) = N x + N priors − N A .

D.3 Master fields
Recently M. Lüscher has derived formulae [15] for statistical errors based on the invariance under translations of space-time, rather than of Monte-Carlo time. We writē with s labelling the sites for a given large-volume configuration, called master field. Following Ref. [16], the Γ method is then defined from the counter-part of Γ MC (t), which we may now call Γ MF ij (s), and which is evaluated from the fluctuations δȳ i (s) = y i (s) −ȳ i . As described in Ref. [15], the variance ofȳ i may be obtained by summing Γ MF ii (s) up to distances |s| < R, thus introducing only an exponentially small bias of order e −2mπR . Our formulae then hold after replacing Γ MC ij with Γ MF ij e.g. in eq. (3.6) and after providing an appropriate windowing procedure.