This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 1
A connection between pattern classification by
machine learning and statistical inference with the
General Linear Model
J.M. Górriz, C. Jiménez-Mesa, F. Segovia, J. Ramírez, SiPBA group and J. Suckling
Abstract—A connection between the general linear model
(GLM) with frequentist statistical testing and machine learning
(MLE) inference is derived and illustrated. Initially, the estima-
tion of GLM parameters is expressed as a Linear Regression
Model (LRM) of an indicator matrix; that is, in terms of the
inverse problem of regressing the observations. Both approaches,
i.e. GLM and LRM, apply to different domains, the observation
and the label domains, and are linked by a normalization value in
the least-squares solution. Subsequently, we derive a more refined
predictive statistical test: the linear Support Vector Machine
(SVM), that maximizes the class margin of separation within
a permutation analysis. This MLE-based inference employs a
residual score and associated upper bound to compute a better
estimation of the actual (real) error. Experimental results demon-
strate how parameter estimations derived from each model result
in different classification performance in the equivalent inverse
problem. Moreover, using real data, the MLE-based inference
including model-free estimators demonstrates an efficient trade-
off between type I errors and statistical power.
Index Terms—General Linear Model, Linear Regression
Model, Pattern Classification, upper bounds, permutation tests,
cross-validation
I. INTRODUCTION
Despite the popularity of machine learning (MLE) as a
solution for a wide range of complex problems [28], [14], there
remains an open question about its usefulness for between-
group statistical inference. Neuroimaging in particular has
embraced MLE as a technology to deliver diagnostic and prog-
nostic classification [3], [20] of neurological and psychiatric
disorders. Nevertheless, the mainstay of neuroimaging studies
are observational and mechanistic, seeking to identify regional
between-group differences in brain structure and function.
Efforts with MLE in this space are increasing with continuous
output variables ([4], with remarks in [31]) rather than the
more typical categorical classifications.
Manuscript received Jan, 2021 , J.M. Górriz, C. Jiménez-Mesa, F. Segovia,
J. Ramírez, SiPBA group are with the Data Science and Computational
Intelligence Institute, University of Granada, Granada, 18071 Spain, e-mail:
gorriz@ugr.es, jg825@cam.ac.uk. J.M. Górriz and J. Suckling are with the
Dpt. Psychiatry, Herchel Smith Buidling for Brain & Mind Sciences, Forvie
Site Robinson Way, University of Cambridge, Cambridge CB2 0SZ, UK. The
SiPBA group are R. Romero-García, A. Ortiz, F.J. Martínez, C.G. Puntonet,
I.A. Illán, J.E. Eloy, D. Castillo, D. Salas, J.M. Mateos
Part of the data used in the preparation of this article were ob-
tained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)
database (http://www.loni.ucla.edu/ADNI). As such, the investigators within
the ADNI contributed to the design and implementation of ADNI
and/or provided data but did not participate in analysis or writing
of this manuscript. ADNI investigators include (complete listing avail-
able at http://www.loni.ucla.edu/ADNI/Collaboration/ADNI Manuscript Cita-
tions.pdf).
Several advances for combining p-value maps have been
proposed based on the concept of prevalence [18], [32] that
go beyond the fixed and mixed (random) effects models [10].
Common to these approaches is the assumption of a mixing
of subject classifications at each voxel that is more realistic
than those assumed in classic random effect approaches; for
example, homogeneity of the binary activation pattern [32],
and offers the possibility of a new framework for modern
statistics.
The concept of prevalence as a fraction of individuals
correctly classified by MLE algorithms in group comparisons
is not novel in neuroimaging, and is indeed the main focus
of predictive inference. As an example, out-of-sample gen-
eralization approaches, such as Cross-Validation (CV), try to
estimate on unseen data the accuracy (Acc) of a classifier in a
binary classification problem. Although the methods and goals
of predictive CV inference are distinct from classical extrap-
olation procedures [24], they are exploited within frameworks
aimed at assessing statistical significance [31]. Bootstrapping,
binomial or permutation (“resampling”) tests [39] are all
examples that have been demonstrated as competitive outside
classical statistics, filling otherwise-unmet inferential needs.
In a pattern classification problem we usually assume
the existence of classes (H1) that can be differentiated by
classifiers with their performance measured in terms of Acc
or prevalence on an independent dataset. Then, we accept
(improperly in a statistical sense) the alternative hypothesis
H1 using empirical confidence intervals such as standard
deviations of the classification Acc from dataset folds. In cases
of limited sample sizes, the most popular k-fold CV method
[23] is sub-optimal under unstable conditions [12], [13], [37].
In such circumstances, the predictive power of the trained
classifiers can be arguable. Moreover, it has been partially
demonstrated that when using only a classifier’s empirical
Acc as a test statistic, the probability of detecting differences
between two distributions is lower than that of a bona fide
statistical test [33], [22].
Beyond empirical techniques for the estimation of per-
formance, MLE is well-established in data-driven statistical
learning theory (SLT), which is primarily devoted to problems
of estimating dependencies with limited amounts of data [36].
Although CV-MLE approaches were not originally designed
to test hypotheses based on prevalence in brain mapping [11],
they are theoretically grounded to provide confidence intervals
(protected inference) in the classification of image patterns
formulated as maps of statistical significance [15]. This can
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 2
be achieved by assessing the upper bounds of the actual error
in a binary classification problem (a confidence interval), and
by using simple significance tests of a population proportion
[15]. This results in improvements to the test’s statistical power
based on Acc. Thus, assessing with high probability the quality
of the fitting function (and its generalization ability) in terms
of in- and out-of-sample predictions can be conceptualized,
under a hypothesis testing scenario, as the inverse problem
of “carefully rejecting H0”; that is, the problem of rejecting
H1, and thus accepting H0 (that there is no effect, or it is not
significant).
In this paper we show a connection between the classical
general linear model (GLM), including random effect models,
with the MLE framework for the estimation of model/classifier
parameters and subsequent analyses to achieve the level of sig-
nificance in group comparisons. In this sense, inference based
on the parametric T statistic and prevalence-based probability
tests are two different paths solving the same problem. We
also show a novel method for achieving statistical significance
using MLE and permutation tests based on concentration ine-
qualities. This approach assesses the worst case of the actual
error and proposes an estimation of the observed distribution
of permuted data.
II. METHODS: CLASSICAL AND MLE STATISTICAL
INFERENCES
A. Background on classical statistics in neuroimaging
The GLM [10] is defined for a single observation level, e.g.
in a between-subject comparison, as:
y = Xθ +  (1)
where y is the N × 1 observation vector with units of time,
voxels, etc.,  is the N ×1 vector of errors that is assumed to
be Gaussian distributed, X is the N×M matrix containing the
explanatory variables or constraints, and θ is the M×1 vector
of parameters explaining the observations y. Note that: i) for
a hierarchical observation model each level requires the prior
estimation of the previous levels; and ii) in terms of MLE,
X plays the role of multidimensional labels or regressors
acting on the observations y. In the classic GLM, θ is usually
estimated by a Maximum Likelihood (ML) criterion based on
the Gaussianity assumption and is given by:
θˆ = (XtC−1 X)
−1XtC−1 y (2)
where C is the covariance matrix of errors. Inferences on
this estimate1: how large are the components of θ and the re-
lationship between classical GLM and MLE-based prevalence
inferences can be obtained using a linear compound specified
by a contrast weight vector c, and writing a T statistic as:
T =
ctθˆ√
ctCov(θˆ)c
(3)
1Here, we refer to voxelwise inference since we use a threshold u to classify
voxels i as “active” if Ti ≥ u. Clusterwise inference uses a cluster-forming
threshold to define contiguous suprathreshold regions [29].
where Cov(θˆ) = (XtC−1 X)
−1. This T statistic gives us
the probability of observing the ML estimation under H0 and
when it is small enough, e.g. p < 0.05, the linear compound
is considered significantly different from zero. As an example,
given a set of two parameters in θ = [θ1, θ2]T , if we select
c = [1− 1] we are assessing how large is the first parameter
with respect to the second; i.e. the difference θ1 − θ2. Thus,
if the T statistic suggests a small probability, the difference
is statistically significant and observations are generated from
different sources.
A similar procedure could be established based on a
Bayesian estimation and inference to handle complex hi-
erarchical observational models. This framework is based
on Expectation Maximization (EM) for parameter estimation
along with known priors and a priori probability models,
with the aim of evaluating the posterior probability (ppm).
By thresholding the ppm, relationships between this and the
frequentist approach can be established including similarities
(statistical power) and differences (specificity) [10].
1) Least Squares of the GLM: The GLM can be estimated
without any assumptions about the noise model by simply
solving the associated Least Squares (LS) problem. Therefore,
if we assume that  = 0 in the GLM, the problem is now
to find the “best” set of parameters θi that explains each
observation yi by:
yj =
M∑
i=1
Xjiθi; for j = 1, . . . , N (4)
Thus, we need to solve the linear regression problem given
in equation 4 to estimate the parameters θi. The most popular
estimation method is LS, in which we select the coefficients
θ to minimize the residual sum of squares:
RS(θ) =
N∑
j=1
(yj −
M∑
i=1
Xjiθi)
2 (5)
The solution to this problem (∂RS(θ)∂θ = 0), the Markov-
Gauss estimate, provides the smallest variance among all linear
unbiased estimates and is given by:
θˆ = (XtX)−1Xty (6)
similar to the GLM solution (equation 2) but assuming C = I
in the latter model; that is, if the errors are assumed to be
independently and identically distributed the ML estimation is
equivalent to the LS solution in equation 6.
B. Converting the estimation of θ into a LS classification
problem
In the LS multiclass classification problem, the goal is to
design M linear functions fi(y) = wtiy, given a set of input
patterns yi and according to a suitable mean squared error
(MSE) criterion with respect to some desired discrete output
binary code xi, i.e. labels. Note that, in general, this setup
is found in neuroimage analyses where the design matrix
contains discrete values, e.g. experimental conditions in fixed-
effects analysis or in random effect modeling between groups.
Recently, the residual score or classification error obtained
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 3
from several methodologies beyond LS (e.g. by applying
the fitted linear hyperplanes to new unseen data) have been
deployed to establish a CV Acc test from data with permuted
labels [31], [15].
1) The inverse problem: LS for regressing an indicator
matrix: Consider the general inverse problem; that is, given a
set of observations {yi}, for i = 1, . . . , N , we are interested
in explaining a set of “explanatory” binary-coded variables xi
(labels) by a matrix W of parameters. This problem, referred
to here as the inverse problem in the label domain, is also
known as the linear regression of an Indicator Matrix or the
linear regression model (LRM) [16]. In this model, we regress
the explanatory variables instead of regressing the observed
responses as in the GLM. This regression can be more accurate
depending on the nature of the data to be fitted e.g. for a
low number of discrete classes in the specified design matrix
X = [xim].
If we have M classes then X is a N ×M matrix, where
each row i = 1, . . . , N contains a single xim = 1, for m =
1, . . . ,M , Y is the N × P matrix of column responses yi
and W is a P ×M coefficient matrix. Thus, we fit a linear
regression model of the form:
X = YW (7)
where the P dimension allows the inclusion of several re-
sponses (multimodality or multiframe acquisitions) given the
same indicator response matrix X. Following the methodology
as that leading to equation 6, the best estimation is given by:
Wˆ = (YtY)−1YtX (8)
which regresses inputs of observations on to a novel set of
labels or constraints:
Xˆ = YWˆ (9)
The novel set Xˆ can be seen as a guess of the constraints for
the set of observation vectors yi, or an approximation of the
posterior probability p(class = m|y). Thus, it allows us to
compute an error model as:
LS = X− Xˆ (10)
2) Connection between θ and w: For simplicity, and to
connect with the GLM as shown in section II-A, let P = 1 in
the LRM, then W = w is a 1×M row vector and Y = y is
an N × 1 column vector. A simple relation between the GLM
and LRM approximations can be found taking into account
that:
X = ywˆ + LS (11)
at the LS solution. Thus, multiplying both sides on the right
by wt we can solve the equation for y and obtain the
corresponding GLM as:
y = (X− LS)θ˜ (12)
where we define θ˜ = wˆt(wˆwˆt)−1 and the GLM noise model
is derived using  = −LSθ˜. The scalar term of equation 12
can be expressed with the LS solution as:
(wˆwˆt)−1 = (yty)2/((Xty)tXty) =
(
∑N
i=1 y
2
i )
2∑M
m=1
∑
i,j yimyjm
(13)
where yim denotes observation i belonging to class m. Thus,
a LS linear regression of the observations can be described by
a GLM regression on the observations (i.e. a linear regression
on the explanatory variables), and vice versa.2
3) Inference of the inverse GLM based on MLE: The LRM
can be seen as a generalization of the GLM for the responses,
coding x as a vector of continuous noisy responses, that is, by
constructing vector targets for each class [16]. From equation
11, which is equivalent to the inverse GLM in equation 1,
inference on this model based on MLE could proceed as
follows. Based on a set of data pairs (yi,xi), we estimate the
set of parameters w using a similar expression to equation 8 or
other more refined predictive algorithms [6], [25], e.g. SVM.
After the fitting process, we assess its significance under the
null hypothesis, likewise the T-statistic inference on the GLM,
on an independent set ΩCV using a CV Acc test statistic:
TCV =
∑
i∈ΩCV
||(xi − (yiwˆ)t||2 (14)
The null distribution is modeled by randomly rearranging
labels a large number of times pi to create artificial data sets,
(yi,xpip), for p = 1, . . . , O, i.e. a permutation test, and eval-
uating the sum of squared residuals TCV with every unseen
sample within the permuted and original set. Consequently,
the p-value is defined by3:
pvalue =
card{TpiCV < TCV }+ 1
O + 1
(15)
where card(.) is the cardinality of a set and TCV and TpiCV
are the CV Acc tests on the original and permuted sets,
respectively. These p-values under the null hypothesis are
pivotal quantities and, in principle, could be used for multiple
testing correction, instead of the statistic image based on
accuracy. The distribution of minimum p-values, pminj , used
in the test to deal with the multiple comparison problem is
limited by the number of permutations, 1O ≤ pminj , and may
cause considerable loss of power [38]. Other methods, such
as RFT [8] or a correction based on the false-discovery rate
(FDR) can be used once the uncorrected p-values have been
obtained for each voxel in the image. In the experiments, due
to the discreteness of the p-values that is strongly limited for
computational reasons (O = 1000), we employ a combination
of correction methods for multiple testing, e.g. a single-
threshold test applied to the map of uncorrected p-values
for comparison purposes in the control of FWE rates, or
a Bonferroni corrected p-value calculated at each voxel for
assessing power.
In the latter test, also known as P-test [31], we assume that
we have a good procedure for estimating w. However, CV
is a standard procedure for estimating the actual error of any
classifier, which is found to be unstable in limited samples
sizes [37], [13]. We could improve this estimation by including
a term to cope with the possibility that the fitting process is
2given a GLM on the observations, we can define a LRM on the explanatory
variables as w˜ = θT (θθT )−1, at the ML solution, with an error LS =
−w˜
3the correction factor +1 in the numerator and denominator is justified by
the inclusion of the original sample set in the test
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 4
not as good as expected, and thus the resulting estimate is
not a good predictor. In this sense, other alternatives [1], [26]
could be tested by the assessment of the worst case based on
concentration inequalities and the resubstitution estimate as:
TRes =
∑
i∈Ω
||(xi − (yiwˆ)t||2 + ∆2(N,P ) (16)
where ∆(N,P ) is model-free upper bound of the actual risk
[13], [36], [15], [5] with a probability at least 1−α and Ω is
the dataset at hand.
III. EXPERIMENTAL RESULTS
In the first of the experiments, and to clearly state the prob-
lem and the solutions, we consider a simple group comparison
with only a single (second) level analysis (a Bayesian approach
of this problem is equivalent to the GLM based inference on
this single level), using a binary design matrix, e.g. it models
the population-specific effects. This is the well-known case-
control design often used as the basis for a diagnostic test;
e.g. Alzheimer patients vs unaffected controls. We adjusted
the GLM and the equivalent problems using LRM and SVM
by regressing observed variables using a simple explanatory
matrix X and a Gaussian model for the noise to obtain two
parameters θ1, θ2, as follows:
Y|N×1 = X|N×2θ|2×1 + |N×1
where, as an example,
X =

1 0
0 1
1 0
1 0
... ...

is a matrix of explanatory variables containing 1s and 0s in-
dicating the class membership of the observation using a two-
element binary code. A more general hierarchical model with
a non-binary design matrix (including regressors, covariates,
etc.) could be processed the same way by fitting the set of
parameters step by step by pattern regression, however we
are interested in assessing the connection between θ and w
in this paper for a binary (design matrix) pattern classification
problem. The objective is two-fold: i) the estimation of model
parameters using both methodologies and domains, linking
them by the theoretical connection in equation 12; and, ii) to
assess how completely they explain observations and labels in
both domains. The second objective can be tackled by showing
the estimations and the group of observations in both domains,
and by quantitatively evaluating the classification error in the
equivalent label domain, given the expected ideal values for
model parameters.
In the last part of this section, we show the inference
analysis derived from the two methodologies in each domain.
We regressed on the observations and on the labels to con-
struct and assess the spatially extended statistical processes,
generating maps of significance, using the MRI ADNI dataset
[15]. In doing so we compared Statistical Parametric Maps
(SPMs, a two-sample T-statistic similar to equation 3), where
significance is first individually assessed at each voxel, and
then combined using three configurations: first, with a cluster-
defining threshold of P = 0.001 (uncorrected for multiple
comparisons) alone; second, then adding a cluster extent
threshold (CET) = 10 voxels; and third, a Family Wise Error
correction at P = 0.05 on the clusters based on random field
theory (RFT) [8]. In addition, the P-tests described in section
II-B3 were also conducted.
A. Simulated Data with Noisy Observations (DG1)
A N -dimensional Gaussian noise vector v was randomly
drawn with zero mean and an N × N covariance matrix
with 2-norm equal to 1. A vector of observations was then
constructed by adding the noise to a binary vector (a column
in the explanatory matrix of indicators); i.e. y = Xk + v
for k ∈ {1, 2}. The design matrix was then obtained by
X = [XkX¯k], where (¯.) denotes logical negation.
Once the observations were artificially drawn (see figure
1), with increasing sample size we regressed both explanatory
variables (LRM by LS and SVM) and observations (GLM) to
obtain a set of two parameters for each model θ = [θ1, θ2],
w = [w1, w2]. All these methods can be employed to estimate
the regressed observed variables using equations 1 and 12,
given the explanatory matrix and the estimated parameters, as
shown in figure 2. In this figure we also plot the distribution
of the T-statistic over 1000 simulations (top), a sample of this
distribution that shows the variability of the estimation using
the GLM around the ideal value of θ2 − θ1 = 1 (middle) and
the estimated observation by each of the models (bottom).
Connected with the previous one-sample GLM estimations,
we plot in figure 3 the estimated parameters explaining the
observations using all the methods along with the observations
they model. Note the large variability of the GLM estimation
with increasing sample size. In figure 4 we show the inverse
problem; how the methods estimate the w from the point of
view of the label regression. In this case, it is readily seen that
the one sample GLM model provides a sub-optimal estimation
at different sample sizes; i.e. the red curve lies above the blue
curve. As expected, the use of these parameters in the dual
classification problem results in a larger empirical error as
shown in figure 5.
From these results we can conclude that the link between the
two approaches resides in the differing nature of the regression
procedure. In both domains there is an implicit classification
task once the parameters, that better explain the corresponding
observations, are derived. These parameters are fitted taking
into account only the empirical data available (including a
noise model, if present). Therefore, wm for a given model
m, can be used to regress the observations to obtain a novel
data set in the label space (new regressed labels), which can be
associated with the states (or classifications) of the explanatory
matrix. This classification task provides an empirical error
(figures 5 and ??). Other methods could be used to obtain
such parameters in a (non-)linear fashion. As an example, we
compared the decision boundary obtained by LRM with that
with SVM (figures 5 and ??), which illustrates the differences
between methods in terms of generalization ability.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 5
Fig. 1: Simulated data with noisy observations (DG1) example
B. Empirical data: a case-control design of the ADNI Dataset
The data used were obtained from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).
The ADNI database contains 1.5 T and 3.0 T T1-weighted
structural MRI scans from patients with Alzheimer’s disease
(AD), Mild Cognitive Impairment (MCI), and cognitively
normal controls (NC) acquired at multiple time points. Here
we only included 1.5T structural MRI. The original database
contains more than 1000 T1-weighted MRI images, although
for the proposed study only the first MRI examination of
each subject was included, resulting in 417 structural MRIs.
Following the recommendation of the National Institute on
Aging and the Alzheimer’s Association (NIA-AA) for the
use of imaging biomarkers [21], we considered the group
comparison NC vs. AD for establishing a clear framework
for comparing statistical paradigms (SPM and TCV ), since
the MCI class is strictly based on clinical criteria, without
including any other biomarker information [27]. Demographic
data is summarized in Table I. The dataset was preprocessed
using standardised neuroimaging methods and protocols im-
plemented in the SPM software (www.fil.ion.ucl.ac.uk/spm/),
including registration in MNI space by spatial normalization
and segmentation to differentiate grey and white matter and
other brain tissues [9]. Here, we used the grey matter (GM)
estimates.
TABLE I: Demographic details of the MRI ADNI dataset, with
group means and their standard deviations
Status Number Age Gender (M/F) MMSE
NC 229 75.97±5.0 119/110 29.00±1.0
AD 188 75.36±7.5 99/89 23.28±2.0
1) Assessing the statistical power: We fitted the set of
parameters using linear SVM and evaluated the TCV statistic
on the original dataset; see figure 6. As shown in this figure,
the resubstitution estimate is more optimistic in the Acc
distribution than the K-fold based estimate. Note that this
analysis is independent of the selected fold as we performed
∼ 106 folds, one per voxel. However, both are optimistic since
the mean of the distribution is not clearly located around 0.5
(it is already shifted to the right, beyond the effect due to
truly significant regions). The effect is even larger when the
groups are slightly imbalanced, simulating the case of over-
powered datasets, as shown in the bottom of the figure 6.
However, note how the corrected bound [13] clearly shifts the
Acc obtained by resubstitution to the left, resulting in a better
(more conservative) estimation of the statistic across the whole
volume.
Based on the TCV and TRes values from the original dataset,
and those obtained using a permutation analysis (O = 1000)
for a selection of structures (e.g. left hippocampus, a brain
structure with a well-established role in the progression of
AD), we compared the SPMs processed with the inference
approaches described in section II-B3. Note that the large
number of voxels that composes an image limits the per-
mutation analysis to specific structures. Results from the
left hippocampus are depicted in figure 7. The permutation
analysis reveals how the power of the TCV approach is affected
in this featured region, where a true effect might be found in
almost the entire structure. The statistical power of the TRes is
preserved through the permutation procedure (2058 detected
voxels vs 1024 voxels out of 2237, figure 7). It is also worth
mentioning the CDF of the errors derived in this particular
region and the corresponding distribution of p-values, recalling
that the dataset included patients with advanced AD, and thus
the selected structure should be significantly affected by the
disease.
To extend the analysis to the whole volume, we approx-
imately simulated the null distribution outside the left hip-
pocampal region in two steps. First, we computed the set of
p-values in the left hippocampus (around 2 · 103 voxels) fol-
lowing equation 15 and determined the averaged T threshold,
Tth, that approximately provided an appropriate significance
level, e.g. 0.05. Then, assuming that for any T < Tth the
probability of an observation is p-value< 0.05, we thresholded
the remainder of the image to obtain the significant voxels
showing an effect. This approach clearly requires multiple-
comparison correction as several dependent or independent
statistical tests are being performed simultaneously at a given
significance level. Therefore, we made the significance level
more conservative with α = 0.001 to reduce the presence
of false positives (FP) in the permutation analyses shown in
section II-B3 and then compared it with the inference made
using the three SPM configurations across the whole volume.
In figure 8 we show the detection ability together with the
control of type I errors in the TRes approach (map in red font).
Note how the permutation test affects the detection ability of
the classical k-fold CV approach (map in green font), and how
the uncorrected voxelwise approaches (in blue font, bottom left
and middle) inflates the number of FPs.
2) Controlling type I error: It is important to evaluate the
ability of the inference methods for controlling the FP rates
[15], [7]. As an example, in [15] the ability of upper-bound
based inference to control type I errors with increasing sample
size was demonstrated using a global test for a proportion
within the whole volume. Moreover, in [7] clusterwise in-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 6
Fig. 2: Estimated Observations and T-statistic distribution. Note that in the GLM model we use the covariance matrix of the
noise to evaluate equation 2; that is, in the estimation of θ. We show the comparison between non-normalized statistics of all
the estimations, i.e. suppressing the covariance term in the GLM, in a random (R=1000) simulation. This clearly demonstrates
that only on average does the ML statistic converge to the ideal value θ2− θ1 = 1 unlike the single sample of this distribution
shown in the bottom left.
Fig. 3: Distribution of observations (y) and estimations of θ for GLM, LRM and SVM in DG1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 7
Fig. 4: Estimations of the parameter w regressing the observations with increasing sample size in DG1
Fig. 5: Classification boundaries and empirical errors given the observations (y) in GLM, LRM and SVM (N = 1000, DG1).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 8
Fig. 6: Bottom: Distribution of voxelwise accuracies of the empirical dataset in two cases: balanced (188 vs 188) and imbalanced
groups (188 vs 229), using k = 10-fold, resubstitution and concentration inequalities [15]. Up: 3D distribution of the accuracies
using k-fold CV and the corrected Accuracy by upper-bounding.
ferences were demonstrated to be under-conservative as they
inflate false positives when the analysis is performed on
resting-state fMRI data. In this paper, we focus our analysis on
specific standardized MRI structures for computational reasons
[19], e.g. the left Heschl gyrus region, instead of doing that on
whole-volume searches as already analyzed in [15], [7] and are
interested in comparing the statistic images derived from the
devised methods; i.e. the number of activated voxels and FWE
rates that arise from them, rather than a specific inference to
control the FWE rate.
In this analysis two groups of subjects (N = 114) were
randomly drawn from a relatively large (N = 228) pool of
NC, and the corresponding p-values, e.g. the ones defined
in equation 15, were computed accordingly. Thus, the null
hypothesis of no group difference in brain activation is true
by construction. The proportion of analyses that give rise to
any significant results; that is, the number of FPs detected,
should be approximately equal to the significance level.
First, we estimated the voxelwise activation rate provided
by the uncorrected SPM within standardized areas [34] in this
randomization analysis. In this case, each voxel statistic is
tested individually and the activation rate for each region is
simply the overall number of suprathrehold voxels divided by
the number of analysis (1000) times the number of voxels
within the region (Nv). Nevertheless, the ensemble of such
partial results (the omnibus test) using specific inferences
provides the estimated FPs given in [15], [7]. The estimated
FWE rates are the number of analyses with any significant
group activation divided by the number of analyses. From this
figure we selected two dummy regions, left Heschl gyrus and
left hippocampus, since they are small and provide extreme
values for the between-group difference.
A total of O = 1k random group draws were undertaken to
obtain the statistic images for each configuration (SPM unc,
SPM unc with CET=10 and P-tests) and then, the empirical FP
rates were computed on the selected regions using the same
Omnibus test. The estimated FP rates are simply the number of
significant results divided by the total number of permutations.
To establish a fair comparison between parametric (SPM unc,
SPM unc CET = 10) and non-parametric maps (CV P-
tests), we employed the same inference method for these
configurations. In particular, we employ a single-threshold test
[19] applied to the p-value images of each method and select
a threshold α on the frequency that the minimum p-value
distribution across the region pmini , for i = 1, . . . , O derived
from randomization is less or equal the minimum p-value of
the test image (see section II-B3). The FP rate is estimated
accordingly as the proportion of randomisation values less or
equal to the number of occurrences divided by O.
Figure 9 illustrates similar results as those described in
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 9
Fig. 7: Permutation analysis of the hippocampus. Note that O = 1000 and the upper bound [13] was obtained with a probability
at least 0.05 and the similarity of the histograms for the p values derived from the analysis on the hippocampus (left, O = 1000)
and on whole volume (right, O = 263). In both cases the number of regions detected by Tres (red map) was larger than that
detected by TCV (green map).
Fig. 8: Parametric and non-parametric statistical maps. Note the trade-off in detection and control of the FWE of the Tres
approach (red map) compared with TCV (green map) and the three SPM configurations (blue maps): on the left, a cluster-
defining threshold of P = 0.001 (uncorrected for multiple comparisons), in the middle, adding a cluster extent threshold = 10
voxels, and on the right a Family Wise Error correction at 0.05 on the clusters.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 10
[15], [7] for global analyses. First, in the figure above we
show the activation rate for each structure by evaluating the
voxel statistic individually. Results are in agreement with the
expected significance levels. On the contrary, the Omnibus
test on p-value images reports more FPs than expected in all
SPM configurations, except for the one based on the over-
conservative clusterwise inference, e.g. voxelwise SPM FWE
corrected. Although no precise control of FPs is assured, we
found our CV-P tests controls FWE below the significance
level, whilst the same inference on the SPMs had FWE ranging
20% − 70%. In other words, the methods based on P = α
CET = 10 voxels has a FWE-corrected P value of 0.2− 0.7.
The P-tests (Tres and TCV ) maps based on the single-
threshold test provide a better control of the type I error than
those based on SPM, whilst it is worth mentioning that un-
corrected SPM-based inferences are clearly dependent on the
selected structure, e.g. in the left hippocampus is close to being
valid, unlike the results found in previous global analyses [7].
A simple single-threshold inference on the minimum p-value
distribution derived from randomization relieves this issue. We
also show how the permutation approach based on the upper
bounds provides a similar estimated FWE rate as that based
on the k-fold CV P-test, but with larger statistical power as
shown in the preceding section.
IV. DISCUSSION
In the context of classification for statistical inference, there
are two primary strategies, either: i) performing k-fold cross-
validation and assessing Acc in several averaged folds; or,
ii) proposing a cross-validation based statistic (P-test) using
an estimation of the actual error of the classifier on a new
set of samples (equation 8). In both cases, if this residual
square (error) is small a good classification is achieved and
constitutes evidence against the H0. In the second approach,
to simulate the null distribution researchers employ a technique
(section II-B3) that is also used in frequentist inference:
the permutation test. A set of label permutations, pip for
p = 1, . . . , O, is generated and then applied to the dataset,
using the same observations, y, and permuted constraints,
xpip , estimating the parameters wpip and computing a set of
residuals for all the permutations. The p-value is derived by
dividing the number of times we randomly obtain a residual
score less than the one we obtain with the original value over
the number of permutations; i.e. p− value = p(RSpi < RS).
This methodology is called a CV-P test [31], where LRM
could be replaced by SVM or another predictive algorithm.
Several limitations are found using only LRM for estimating
the posterior probability. Linear regression is only operative in
binary classification, e.g the regression could be negative or
even greater than zero [16]. Indeed, and as shown in the exam-
ple, in this case there is a strong correspondence between GLM
and LRM for a single level analysis in group comparisons.
Thus, complex classifiers and other loss functions are needed
for relieving bad estimations on the set of parameters. Beyond
that, the selected predictive algorithms build their P-tests on
the CV strategy that could be a biased estimator of the actual
error in heterogeneous datasets, such as those encountered in
neuroimaging [37].
Frequentist and Bayesian inferences depend on specified
models when proposing a T statistic and fitting parameters of
the GLM. This is partly solved again by the use of permutation
analysis in the estimation of the null distribution, but what
about the T statistic definition? This is also described in
terms of the error covariance matrix, which must be estimated
on empirical data in limited sample sizes. In the synthetic
examples we assumed a known covariance matrix in the
formulation of the GLM. Despite that, the T-statistic following
on from the best guess fluctuated around the ideal value
and resulted in low classification rates. How is frequentist or
Bayesian analysis actually undertaken? Again, there are model
selection and parameter fitting stages to achieve where, in
complex scenarios with a limited sample size, heuristics are the
common solutions [40]. Indeed, in the high dimensional case
or under the assumption of complex models, the performance
and operation of these approaches are arguable [31]. Where
the estimation of parameters is computationally costly, the
tendency is to use heuristics for solving such issues. For
example, in the FSL tools based on Bayesian inference, such
as BET (Brain extraction tool), TBSS (tract-based spatial
statistics), FLIRT (FMRIB’s linear image registration tool),
PRELUDE/FUGUE (phase unwarping and MRI unwarping),
and MELODIC ICA, the use of heuristics is common practice
and the estimation of the full posterior distribution of model
parameters is biased.
In summary, limited samples sizes and the selec-
tion/estimation of any specific model is still an issue in
neuroimaging, made more difficult when the model and the
interaction between model parameters becomes too complex
for an accurate posterior probability estimation, or a feasible
numerical computation of the Bayes rule. Given the connection
between the two observation models - GLM and LRM - in
this paper we propose a statistical inference that leverages
an agnostic theory about the estimation of dependencies,
established in the pattern classification problem with limited
amounts of data [36], [17].
In this sense, given the connection between the two
paradigms of statistical inference, we are supported by MLE
algorithms to provide new statistical tests, e.g. the P-tests, to
highlight differences in patterns of imaging-derived measures
between groups. The P-test based on the upper bound cor-
rection provides the same type I error control as the k-fold
CV approach, and a trade-off in statistical power between
clusterwise inferences (invalid for global analyses) and over-
conservative voxelwise parametric and k-fold CV P-test infer-
ences, as shown in the experiments.
V. CONCLUSIONS
In this paper we propose the application of permutation
tests and agnostic theory to the set of regressed outputs by
the definition of the residual score or Acc test. The latter
framework is a consequence of the connection between the
classification problem and statistical inference based on the
GLM. Then, we employed permutation tests and a better
estimation of actual error based on concentration inequalities
to provide a trade-off between the type I error and statistical
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 11
Fig. 9: Estimated activation and FP rates and histograms of the activated voxels for the selected structures at given significance
levels α. On the left, the Left hippocampus analysis: we show at the top the FP rates derived from the Omnibus test. In
the middle, the probability of activation for individual analyses (type I error control) and the histogram of activated voxels
(counts vs. number of activated voxels per test). On the right, left Heschl gyrus analysis. Note: no FPs were detected using
over-conservative voxelwise SPM inference.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2021.3101662, IEEE Journal of
Biomedical and Health Informatics
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. XX, NO. X, JANUARY 2021 12
power. Previous results have demonstrated the ability of such
estimator to provide maps of significance [15] where a random
simulation on controls resulted in a nominal rate of FPs.
In particular, we see equivalence in the estimation of the
observation and (explanatory) label domains, thus any test
performed in the label space using an Acc test is similar to
those used in neuroimaging over the last decade. Moreover,
prevalence (the scores in equations 14 and 16) is a valid
measure for statistical inference without using any model as
first assumptions. Our approach computes this score using all
the data available, instead of using a k-fold strategy, and with
the resulting set of accuracies we estimate the true value based
on the upper bounds with probability at least 1 − α. Then, a
permutation analysis is derived using this measure to simulate
the distribution of the null hypothesis and finally, a test can
be formulated as a classic statistical inference. Putative design
tasks and random experiments on empirical datasets to assess
type I error and statistical power, respectively, confirm the
nominal performance of the methodology, and demonstrate its
potential.
ACKNOWLEDGMENTS
This work was partly supported by the Ministerio de Ciencia
e Innovación (España)/ FEDER under the RTI2018-098913-
B100 project, by the Consejería de Economía, Innovación,
Ciencia y Empleo (Junta de Andalucía) and FEDER under
CV20-45250, A-TIC-080-UGR18 and P20-00525 projects,
and by the Ministerio de Universidades under the FPU Pre-
doctoral Grant FPU 18/04902.
REFERENCES
[1] A. Antós, et al. Data-dependent margin-based generalization bounds for
classification. Journal of Machine Learning Research 3 (2002) 73?98
[2] C.J.C Burges. A tutorial on support vector machines for pattern recogni-
tion Data Mining and Knowledge Discovery, 2 (2) (1998), pp. 121-167
[3] D.Bzdok. Classical Statistics and Statistical Learning in Imaging Neu-
roscience. Front. Neurosci., 06 October 2017
[4] J. R.Cohen, et al. Decoding continuous behavioral variables from
neuroimaging data. Front. Neurosci. 5. 2011.
[5] T.M. Cover. Geometrical and Statistical properties of systems of linear
inequalities with applications in pattern recognition. IEEE Transactions
on Electronic Computers. EC-14: 326?334 (1965)
[6] F. DeMartino, et al. Combining multivariate voxel selection and support
vector machines for mapping and classification of fMRI spatial patterns
NeuroImage, 43 (1) (2008), pp. 44-58
[7] A.Eklund, et al. Cluster failure: Inflated false positives for fMRI.
Proceedings of the National Academy of Sciences Jul 2016, 113 (28)
7900-7905.
[8] R.S.J. Frackowiak, et al. Human Brain Function (Second Edition). Chap.
44. Introduction to Random Field Theory. ISBN 978-0-12-264841-0
Academic Press. 867-879, 2004.
[9] K.J.Friston, et al. Statistical Parametric Maps in functional imaging: A
general linear approach Hum. Brain Mapp. 2:189-210 (1995)
[10] K.J.Friston, et al. Classical and Bayesian inference in neuroimaging:
theory NeuroImage, 16 (2) (2002), pp. 465-483
[11] K. J.Friston, Sample size and the fallacies of classical inference. Neu-
roImage 81 (2013) 503?504
[12] J.M.Górriz, et al. A Machine Learning Approach to Reveal the Neu-
roPhenotypes of Autisms. International journal of neural systems,
1850058. 2019.
[13] J.M.Górriz, et al. On the computation of distribution-free performance
bounds: Application to small sample sizes in neuroimaging. Pattern
Recognition 93, 1-13, 2019.
[14] J.M.Gorriz, et al. Artificial intelligence within the interplay between
natural and artificial computation: Advances in data science, trends and
applications. Neurocomputing Volume 410, 14 October 237-270 2020.
[15] J.M.Gorriz, et al. Statistical Agnostic Mapping: A framework in neuro-
imaging based on concentration inequalities. Information Fusion Volume
66, February 2021, Pages 198-212
[16] T. Hastie, et al. The elements of statistical learning theory. Data Mining
inference and prediction. Ed Springer. isbn 0-387-95284-5. 2001
[17] D. Haussler. Decision theoretic generalizations of the PAC model for
neural net and other learning applications. Information and Computation
Volume 100, Issue 1, September 1992, Pages 78-150
[18] R. Heller, et al. 2007. Conjunction group analysis: An alternative to
mixed/random effect analysis. NeuroImage 37, 1178-1185.
[19] A.P. Holmes. Nonparametric Analysis of Statistic Images from Func-
tional Mapping Experiments. Journal of Cerebral Blood Flow and
Metabolism. 16:7-22
[20] I.A. Illan, et al. Automatic assistance to Parkinson’s disease diagnosis
in DaTSCAN SPECT imaging. Medical Physics. 2012
[21] C.C.Jack,Jr. ,et al. NIA-AA Research Framework: Toward a biological
definition of Alzheimer?s disease. Alzheimers Dement. 2018 Apr; 14(4):
535?562.
[22] I. Kim, et al. Classification accuracy as a proxy for two sample testing
Annals of Stat., 2020
[23] R.A.Kohavi. Study of CV and bootstrap for accuracy estimation and
model selection. Proc. of the 14th international joint conference on AI
- Vol. 2 pp 1137-1143 (1995)
[24] M.A Lindquist, et al. Ironing out the statistical wrinkles in "ten ironic
rules". Neuroimage. 2013 Nov 1;81:499-502.
[25] F.J.Martinez, et al. Studying the Manifold Structure of Alzheimer’s Dis-
ease: A Deep Learning Approach Using Convolutional Autoencoders.
IEEE J Biomed Health Inform. 2019 Jun 17.
[26] P.Massart. Some applications of concentration inequalities to statistics.
Annales de la Faculté des Sciences de Toulouse, 2000.
[27] G.M.McKhann, et al. The diagnosis of dementia due to Alzheimer?s
disease: recommendations from the National Institute on Aging and the
Alzheimer?s Assocation Workgroup. Alzheimers Dement. 2011;7:263?9.
[28] J.Mouro-Miranda, et al. Classifying brain states and determining the
discriminating activation patterns: Support vector machine on functional
MRI data. NeuroImage, 28, 980?995. (2005).
[29] T.E.Nichols. Multiple testing corrections, nonparametric methods, and
random field theory. NeuroImage 62 (2012) 811?815
[30] M.Ojala, et al. Permutation tests for studying classifier performance.
Journal of Machine Learning Research. 2010; 11:1833?1863.
[31] P.T. Reiss, et al. Cross-validation and hypothesis testing in neuro-
imaging: an irenic comment on the exchange between Friston and
Lindquist et al. Neuroimage. 2015 August 1; 116: 248?254
[32] J.D. Rosenblatt, et al. Revisiting multi-subject random effects in fMRI:
Advocating prevalence estimation. NeuroImage 84 (2014): 113-121.
[33] J.D. Rosenblatt, et al. Better-than-chance classification for signal detec-
tion. Biostatistics (2016).
[34] N. Tzourio-Mazoyer, et al. Automated anatomical labeling of activations
in spm using a macroscopic anatomical parcellation of the MNI MRI
single subject brain. Neuroimage 2002; 15: 273-289.
[35] V. Vapnik, A.Y. Chervonenkis On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and
Its Applications, 16:264?280, 1971.
[36] V. Vapnik. Estimation dependencies based on Empirical Data. Springer-
Verlach. 1982 ISBN 0-387-90733-5
[37] G. Varoquaux. Cross-validation failure: Small sample sizes lead to large
error bars. NeuroImage 180 (2018) 68?77.
[38] A.M.Winkler. Permutation inference for the general linear model. Neu-
roImage Volume 92, 15 May 2014, Pages 381-397.
[39] A.M.Winkler, et al. Non?parametric combination and related permuta-
tion tests for neuroimaging. Human brain mapping 37.4 (2016): 1486-
1511.
[40] M.K. Woolrich, et al. Bayesian analysis of neuroimaging data in FSL.
NeuroImage 45 (2009) S173?S186.