^{1}

^{†}

^{2}

^{†}

^{1}

^{1}

^{3}

^{1}

^{1}

^{*}

^{1}

^{2}

^{3}

Edited by: Qinghua Jiang, Harbin Institute of Technology, China

Reviewed by: Charles M. Rowland, Quest Diagnostics, United States; Kui Zhang, Michigan Technological University, United States

^{†}These authors have contributed equally to this work

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Complex diseases are believed to be the consequence of intracellular network(s) involving a range of factors. An improved understanding of a disease-predisposing biological network could lead to better identification of genes and pathways that confer disease risk and therefore inform drug development. The group difference in biological networks, as is often characterized by graphs of nodes and edges, is attributable to effects of these nodes and edges. Here we introduced pointwise mutual information (PMI) as a measure of the connection between a pair of nodes with either a linear relationship or nonlinear dependence. We then proposed a PMI-based network regression (PMINR) model to differentiate patterns of network changes (in node or edge) linking a disease outcome. Through simulation studies with various sample sizes and inter-node correlation structures, we showed that PMINR can accurately identify these changes with higher power than current methods and be robust to the network topology. Finally, we illustrated, with publicly available data on lung cancer and gene methylation data on aging and Alzheimer’s disease, an evaluation of the practical performance of PMINR. We concluded that PMI is able to capture the generic inter-node correlation pattern in biological networks, and PMINR is a powerful and efficient approach for biological network analysis.

A complex disease is understood to be the consequence not of abnormality involving a single biomolecule (e.g., RNA, protein, metabolite) but of their network(s) and possibly a variety of other factors (

A biological network is commonly described as a graph such that nodes (or vertices) are used to represent biomolecules and edges to represent consequences or physiological interactions between vertices. In general, both the node effects (e.g., the magnitude of each gene’s expression in regulation network) and the edge effects (e.g., the strength of connection) can contribute to the disease. A given biological network is characterized with respect to what the nodes represent and what the nature of the interactions is between these nodes (edges) (

It is particularly challenging to quantify inter-node connection strength precisely with a unified metric, especially when involving group (e.g., patients versus healthy controls) differences in biological networks (

In more detail, our approach is concerned about regression methodology for assessing relationships between disease outcome and a particular biological network with adjustment for potential confounding factors. Below we first introduce pointwise mutual information (PMI) to measure the strength of connection between a pair of nodes in the network, as currently PMI is commonly used in machine learning and text mining (

The PMI of two node variables

where

where ^{T} and _{i} = (_{i},_{i})^{T}, _{H}(^{−1/2}^{−1/2}

Assume that we have a biological network with

be the binary response variable, _{s}(

where _{i} denotes the

is an indicator variable, _{ij} denotes the estimator of PMI between node _{i} and node _{j} using BKDE, respectively. The regression coefficients are denoted by α_{s}, β_{i} and γ_{ij}. Here, we use the cubic- and quadratic-spline interpolation to construct the BKDE-based estimator of PMI. PMINR naturally decomposed the change of the whole network into the node changes and edge changes. Using a likelihood ratio test, it can test whether the whole network is significantly associated with the response variable, and using a Wald test it can detect identify which nodes or edges are related to the response variable.

To make our simulation more realistic, we set as our model network the topological structure from the pathway of insulin resistance downloaded from Kyoto Encyclopedia of Genes and Genomes (KEGG) including 26 nodes and 37 edges (

The simulated network structure based on the Insulin resistance pathway from KEGG.

the product moment network regression (PMNR) which uses the common linear correlation to represent the between-node connection strength.

the DGCA method which is differential gene correlation analysis (i.e., edge effect) to assess the difference in gene-gene regulatory relationships under different conditions (

the RANK method which can detect the whole pathway due to either correlations or mean changes (

Each scenario included four situations: (1) only nodes of network having the effect, (2) only edges of network having effects, (3) both nodes and edges having effects, with the nodes not hanging on the edge (e.g., node X_{6} and edge E_{4,10} in _{4} and E_{4,10} in

In scenario 1, we generated data using the linear correlation to represent the network edge and evaluate the performance of all these four methods. We randomly assigned the effecting node and edge for the four aforementioned situations, respectively. The simulated _{m}(0,Σ) with covariance matrix Σ using the R package _{ij}ρ_{ij})_{m×m}, where

_{ij} is assigned by randomly choosing a number from 0.1 to 0.55 with a step 0.05 and the eigenvalues are calculated to judge whether the covariance matrix is positive definite. We generated the response variable

where _{i} and _{ij} denotes the different vertices and edges between two groups (case _{i}andγ_{ij} denote the corresponding effect size on _{i} = 0, γ_{ij} = 0,

We further considered three other patterns of nonlinear relationships between the network nodes, _{j} = _{i} (scenario 3), _{j} = (_{i})^{2} (scenario 4). The data were generated based on the pre-defined nonlinear relationship. For instance, if we assign the sine relationship between node _{4} and node _{10}, then _{10} = α*_{4} + ε, the parameter α was used to represent the nonlinear connection strength between _{4} and _{10}. Note that the nonlinear sine relationship between _{4}and_{10}can be depicted by the linear relationship between _{4} and _{10}. We set _{4,10} = α*_{4}*_{10} to generate the response

For each scenario, 1000 replicates were used to evaluate the performance of type I error and power under different sample sizes (300, 400, 500, 600, 1000). We further designed four other scenarios under the same settings as above, except that the changing node and edges are fixed rather than randomly selected for each replicate.

We first applied PMINR to analyze the gene expression data on lung cancer, available from Gene Expression Omnibus (GEO) with accession number

We then applied PMINR to the gene methylation data from the ROSMAP study as divided into two parts, ROS (The Religious Orders Study) and The Memory and Aging Project (MAP). The ROS is a longitudinal clinical-pathologic cohort study of aging and Alzheimer disease (AD;

Shown in

Type I error of PMINR, PMNR, RANK and DGCA.

Shown in

The statistical power of PMINR, PMNR, RANK and DGCA under scenario 1.

The statistical power of PMINR, PMNR, RANK and DGCA under scenario 2.

The statistical power of PMINR, PMNR, RANK and DGCA under scenario 3.

The statistical power of PMINR, PMNR, RANK and DGCA under scenario 4.

Shown in

Lung cancer network regression of various methods with

Method | Edge | Node |

PMINR | ||

PMNR | ||

DGCA | ||

RANK | global network (0.022) | global network (0.022) |

Shown in

AD network regression of various methods with p values in parenthesis.

Method | Edge | Node |

PMINR | ||

PMNR | ||

DGCA | ||

RANK | global network (0.012) | global network (0.012) |

It should be noted that under Bonferroni correction, only

In recognition of the importance of biological networks as in complex diseases (

Findings from the NSCLC dataset are consistent with earlier reports. Increasing expression of

The systemic failure of calmodulin degradation, and thus of Ca(2+)/ calmodulin dependent signaling pathways, may be important in the etiopathogenesis of AD. Both

The apparent limitation in assuming known biological network structure can actually be useful for learning network structure which determines every possible edge with the highest degree of data matching, and a joint probability distribution of network nodes can reflect more than one network structure. Often, most biologists can roughly describe more or less the specific network for the corresponding biological process, and facilitated by multiple databases (such as KEGG) to establish the network structure. The inference of PMINR directly plugs the estimate of inter-node correlation into the regression model and fails to account for the uncertainty during inter-node correlation estimate. It should be noted that such inference procedure may lead to the biased estimate and power loss, especially in smaller sample size. The p values at present study are without accounting for the multiple testing. Often, the node test and the edge test are often highly correlated, and it is not straightforward to correct the p value or control the false discovery rate. However, not taking the multiple testing into account may make the interpretation of the results unclear, given that the truth is often unknown in practice. It is desirable to develop methods that can calculate the effective number of independent tests, to further address the multiple testing issue. In addition, caution should be used against the interpretation of estimated individual node and edge effects, given the potential for statistical mediation of effects within the network.

In conclusion, PMI captures the general inter-node correlation pattern in biological networks, and PMINR is powerful and efficient for biological network analysis.

Publicly available datasets were analyzed in this study. The datasets analyzed for this study can be found in the GEO with accession number

ZY conceived the study. JJ and WL contributed to the data analysis. YZ, ML, FX, and JZ contributed to the data interpretation. ZY, WL, and JJ wrote the manuscript with help from JZ. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to thank GEO for providing the lung cancer data, and thank all the participants of the ROSMAP Study. The results published here are in whole or in part based on data obtained from the AMP-AD Knowledge Portal (

The Supplementary Material for this article can be found online at: