EnrichDO: a Global Weighted Model for Disease Ontology Enrichment Analysis

Liang Cheng

College of Bioinformatics Science and Technology, Harbin Medical University

Haixiu Yang

College of Bioinformatics Science and Technology, Harbin Medical University

Hongyu Fu

College of Bioinformatics Science and Technology, Harbin Medical University

2024-09-19

Introduction

Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most current DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we presents EnrichDO, a double weighted iterative model, which is based on the latest annotations of the human genome with DO terms and integrates the DO graph topology on a global scale. On one hand, to reinforce the saliency of direct gene-DO annotations, different initial weights are assigned to directly annotated genes and indirectly annotated genes, respectively. On the other hand, to detect locally most significant node between the parent and its children, less significant nodes are dynamically down-weighted. EnrichDO exhibits high accuracy that it can identify more specific DO terms, which alleviates the over enriched problem.

EnrichDO encompasses various statistical models and visualization schemes for discovering the associations between genes and diseases from biological big data. Currently uploaded to CRAN, EnrichDO aims to provide a more convenient and effective DO enrichment analysis tool.

library(EnrichDO)
#> 

Weighted DO Enrichment Analysis

EnrichDO presents a double weighted iterative model for DO enrichment analysis. Based on the latest annotations of the human genome with DO terms, EnrichDO can identify locally significant enriched terms by applying different initial weights and dynamic weights for annotated genes and integrating the DO graph topology on a global scale. EnrichDO presents an effective and flexible model, which supplies various statistical testing models and multiple testing correction methods.

doEnrich function

In EnrichDO, we use doEnrich to realize the enrichment analysis of DO by integrating topological properties of DO graph structures.

Result description

In the following example, several genes (demo.data) are randomly selected from the protein-coding genes for analysis. The parameters of doEnrich are default.

demo.data=c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553)
demo_result<-doEnrich(interestGenes=demo.data)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    13 nodes    2279 genes to be scored
#> LEVEL: 9 54 nodes    6504 genes to be scored
#> LEVEL: 8 130 nodes   9483 genes to be scored
#> LEVEL: 7 198 nodes   11209 genes to be scored
#> LEVEL: 6 220 nodes   12574 genes to be scored
#> LEVEL: 5 198 nodes   12936 genes to be scored
#> LEVEL: 4 103 nodes   12824 genes to be scored
#> LEVEL: 3 30 nodes    11683 genes to be scored
#> LEVEL: 2 5 nodes 8032 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored
show(demo_result)
#> 
#> ------------------------- EnrichResult object -------------------------
#> Method of enrichment:
#>   Global Weighted Model
#>   'hypergeomTest' Statistical model with the 'BH' Multiple hypothesis correction
#> Enrichment cutoff layer: 1
#> interestGenes number: 13
#> 231 DOTerms scored: 231 terms with p < 0.01
#> Parameter setting:
#>   Enrichment cutoff layer: 1
#>   Doterm gene number limit: minGsize 5, maxGsize 5000
#>   Enrichment threshold: 0.01

Running doEnrich will output the terms and total genes involved in each layer of Directed acyclic graph (DAG) to the user. The show method can be used to present the overall results to user.

The result of doEnrich is demo_result which contains enrich, interestGenes, test, method, m, maxGsize, minGsize, delta, traditional, penalize. There are 16 columns of enrich, including:

  • The standard ID corresponding to the DO Term (DOID).

  • the standard name of the DO Term (DOTerm), each DO Term has a unique DOID.

  • We constructed a directed acyclic graph according to the is_a relationship between each node in the DO database, and each DO Term has a corresponding level (level).

  • The DO database stores the parent node of each DO Term (parent.arr) and its number (parent.len). For example, “B-cell acute lymphoblastic leukemia” (DOID:0080638) is_a “acute lymphoblastic leukemia” (DOID:9952) and “lymphoma” (DOID:0060058), then the node “B-cell acute lymphoblastic leukemia” is a child of “acute lymphoblastic leukemia” and “lymphoma”, and the child is a more specific biological classification than its parent.

  • child nodes of the DO Term (child.arr) and their number (child.len).

  • the latest GeneRIF information is used to annotate DO Terms, each DO Term has its corresponding disease-associated genes (gene.arr), and its number (gene.len).

  • Assigning a weight to each gene helps assess the contribution of different genes to DO Terms (weight.arr).

  • The smaller the weights of indirectly annotated genes, the less contribution of these genes in the enrichment analysis.(gene.w).

  • the P-value of the DO Term (p), which arranges the order of enrich, and the value of P-value correction (p.adjust).

  • the genes of interest annotated to this DO Term (cg.arr) and its number (cg.len).

  • the number of genes in the interest gene set (ig.len), this represents the number of genes that are actually used for enrichment analysis.

Generally, a significant P value of the enrichment results should be less than 0.05 or 0.01, indicating a significant association between the interesting gene set and the disease node. In the enrich, the node with the most significant enrichment is DOID:0080832, and the DO Term is “mild cognitive impairment”, with its P-value being 9.22e-16. These results suggests that there is statistical significance between the interesting gene set and the DO Term of mild cognitive impairment.

The data frame doterms contains the information of all DO Terms including DAG structures. You can use showDoTerms to see the details.

head(doterms)
#>        DOID level     gene.arr   weight.arr parent.arr parent.len child.arr
#> 1 DOID:3720    13          595            1  DOID:3721          1          
#> 2 DOID:3722    13          596            1  DOID:3721          1          
#> 3 DOID:4927    13 9788, 74.... 1, 1, 1,....  DOID:4928          1          
#> 4 DOID:5746    13 9354, 66.... 1, 1, 1,....  DOID:3605          1          
#> 5 DOID:7024    13   4583, 1045         1, 1  DOID:4928          1          
#> 6 DOID:7642    13 8289, 45.... 1, 1, 1,....  DOID:4928          1          
#>   child.len gene.len                                   DOTerm
#> 1         0        1              extramedullary plasmacytoma
#> 2         0        1            solitary osseous plasmacytoma
#> 3         0       30                         Klatskin's tumor
#> 4         0       15        ovarian serous cystadenocarcinoma
#> 5         0        2 mucinous intrahepatic cholangiocarcinoma
#> 6         0        5            cholangiolocellular carcinoma
showDoTerms(doterms)
#> 
#> -------------annotation information for DO terms---------------
#> There are 4813 DOTerms with 10 col
#>   DOID: DOterm ID on enrichment
#>   level: the hierarchy of DOterm in the DAG graph
#>   gene.arr: all genes related to the DOterm
#>   weight.arr: gene weights in each node
#>   parent.arr: the parent node of the DOterm
#>   child.arr: child nodes of the DOterm
#>   DOTerm: the standard name of DOterm
#>   .len: represents the corresponding quantity

Application cases of doEnrich function

1.Weighted enrichment analysis with multiple parameters. Each parameter in the following example is suitable for enrichment analysis with weights. You can modify the parameter values as required.

weighted_demo<-doEnrich(interestGenes=demo.data,
                           test="fisherTest",
                           method="holm",
                           m=1,
                           minGsize=10,
                           maxGsize=2000,
                           delta=0.05,
                           penalize=TRUE)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    12 nodes    2278 genes to be scored
#> LEVEL: 9 50 nodes    5376 genes to be scored
#> LEVEL: 8 116 nodes   7751 genes to be scored
#> LEVEL: 7 181 nodes   9463 genes to be scored
#> LEVEL: 6 193 nodes   10144 genes to be scored
#> LEVEL: 5 174 nodes   9756 genes to be scored
#> LEVEL: 4 85 nodes    9088 genes to be scored
#> LEVEL: 3 18 nodes    4599 genes to be scored
#> LEVEL: 2 1 nodes 1605 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored

2.The parameter penalize is used to alleviate the impact of different magnitudes of p-values, default value is TRUE. When set to FALSE, the degree of reduction in weight for non-significant nodes is decreased, resulting in a slight increase in significance for these nodes, i.e., their p-value will be reduced.

penalF_demo<-doEnrich(interestGenes=demo.data, penalize=FALSE)
#>       -- Descending rights test--
#> LEVEL: 13    1 nodes 72 genes to be scored
#> LEVEL: 12    2 nodes 457 genes to be scored
#> LEVEL: 11    3 nodes 907 genes to be scored
#> LEVEL: 10    13 nodes    2279 genes to be scored
#> LEVEL: 9 54 nodes    6504 genes to be scored
#> LEVEL: 8 130 nodes   9483 genes to be scored
#> LEVEL: 7 198 nodes   11209 genes to be scored
#> LEVEL: 6 220 nodes   12574 genes to be scored
#> LEVEL: 5 198 nodes   12936 genes to be scored
#> LEVEL: 4 103 nodes   12824 genes to be scored
#> LEVEL: 3 30 nodes    11683 genes to be scored
#> LEVEL: 2 5 nodes 8032 genes to be scored
#> LEVEL: 1 0 nodes 0 genes to be scored

3.Using the traditional enrichment analysis method, it does not reduce weights according to the DAG structure. Parameters test, method, m, maxGsize and minGsize can be used flexibly.

Tradition_demo<-doEnrich(demo.data, traditional=TRUE)
#>       -- Traditional test--

writeDoTerms function

writeDoTerms can output DOID, DOTerm, level, genes, parents, children, gene.len, parent.len and child.len in the data frame doterms as text. The default file name is “doterms.txt”.

writeDoTerms(doterms,file=file.path(tempdir(), "doterms.txt"))

writeResult function

The writeResult function can output DOID, DOTerm, p, p.adjust, geneRatio, bgRatio and cg in the data frame enrich as text. The default file name is “result.txt”.

geneRatio represents the intersection of the DO term annotated genes with the interest gene set divided by the interest gene set, and bgRatio represents all genes of the DO term divided by the background gene set.

writeResult has four parameters. EnrichResult indicates the enrichment result of doEnrich, file indicates the write address of a file. The parameter Q (and P) indicates that a DO term is output only when p.adjust (and p value) is less than or equal to Q (and P). The default values for P and Q are 1.

writeResult(EnrichResult = demo_result,file=file.path(tempdir(), "result.txt"), Q=1, P=1)

Visualization of enrichment results

EnrichDO provides four methods to visualize enrichment results, including bar plot (drawBarGraph), bubble plot (drawPointGraph), tree plot (drawGraphviz) and heatmap (drawHeatmap), which can show the research results more concisely and intuitively. Pay attention to the threshold setting for each visual method, if the threshold is too low, the display is insufficient.

drawBarGraph function

drawBarGraph can draw the top n nodes with the most significant p-value as bar chart, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawBarGraph(EnrichResult=demo_result, n=10, delta=0.05)
bar plot

bar plot

drawPointGraph function

drawPointGraph can draw the top n nodes with the most significant p-value as bubble plot, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawPointGraph(EnrichResult=demo_result, n=10, delta=0.05)
point plot

point plot

drawGraphViz function

drawGraphViz draws the DAG structure of the most significant n nodes, and labelfontsize can set the font size of labels in nodes (By default, n is 10 and labelfontsize is 14). Each node has a corresponding disease name displayed.

In addition, the drawGraphViz function can also display the P-value of each node in the enrichment analysis (pview=TRUE), and the number of overlapping genes of each DO term and interest gene set (numview=TRUE).


drawGraphViz(EnrichResult=demo_result, n=10, numview=FALSE, pview=FALSE, labelfontsize=17)
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:5] "DOID:936" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:7] "DOID:649" "DOID:0050117" "DOID:936" "DOID:4" "DOID:331" ...
#>  chr [1:4] "DOID:0080599" "DOID:934" "DOID:0050117" "DOID:4"
#>  chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:2] "DOID:0014667" "DOID:4"
tree plot

tree plot

drawHeatmap function

drawHeatmap function visualizes the strength of the relationship between the top DOID_n nodes from enrichment results and the genes whose weight sum ranks the top gene_n in these nodes. And the gene must be included in the gene of interest. readable indicates whether the gene is displayed as its symbol.

drawHeatmap also provides additional parameters from the pheatmap function, which you can set according to your needs. Default DOID_n is 10, gene_n is 50, fontsize_row is 10, readable is TRUE.

drawHeatmap(interestGenes=demo.data,
            EnrichResult=demo_result,
            gene_n=10,
            fontsize_row=8,
            readable=TRUE)
#> gene symbol conversion result:
#> 
#> 'select()' returned 1:1 mapping between keys and columns
heatmap

heatmap

convenient drawing

Draw(drawBarGraph ,drawPointGraph ,drawGraphViz) from writeResult output files, so you don’t have to wait for the algorithm to run.

#Firstly, read the wrireResult output file,using the following two lines
data <- read.delim(file.path(system.file("examples", package="EnrichDO"), "result.txt"))
enrich <- convDraw(resultDO=data)
#> The enrichment results you provide are stored in enrich
#> Now you can use the drawing function

#then, Use the drawing function you need
drawGraphViz(enrich=enrich)    #Tree diagram
#>  chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:2] "DOID:150" "DOID:4"
#>  chr [1:4] "DOID:3324" "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:10] "DOID:0060004" "DOID:3213" "DOID:331" "DOID:438" "DOID:863" ...
#>  chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"

drawPointGraph(enrich=enrich, delta = 0.05)  #Bubble diagram

drawBarGraph(enrich=enrich, delta = 0.05)    #Bar plot

Session information

sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] EnrichDO_0.1-1
#> 
#> loaded via a namespace (and not attached):
#>   [1] fgsea_1.24.0           colorspace_2.1-0       ggtree_3.10.1         
#>   [4] gson_0.1.0             qvalue_2.30.0          XVector_0.38.0        
#>   [7] fs_1.6.3               aplot_0.2.2            rstudioapi_0.16.0     
#>  [10] farver_2.1.1           graphlayouts_1.1.1     hash_2.2.6.3          
#>  [13] ggrepel_0.9.5          bit64_4.0.5            AnnotationDbi_1.64.1  
#>  [16] fansi_1.0.6            scatterpie_0.2.1       codetools_0.2-20      
#>  [19] splines_4.2.2          cachem_1.0.8           GOSemSim_2.28.1       
#>  [22] knitr_1.47             polyclip_1.10-6        jsonlite_1.8.8        
#>  [25] GO.db_3.16.0           png_0.1-8              pheatmap_1.0.12       
#>  [28] graph_1.76.0           ggforce_0.4.2          readr_2.1.5           
#>  [31] compiler_4.2.2         httr_1.4.7             Matrix_1.6-5          
#>  [34] fastmap_1.1.1          lazyeval_0.2.2         cli_3.6.2             
#>  [37] tweenr_2.0.3           htmltools_0.5.8        tools_4.2.2           
#>  [40] igraph_2.0.3           gtable_0.3.4           glue_1.7.0            
#>  [43] GenomeInfoDbData_1.2.9 reshape2_1.4.4         dplyr_1.1.4           
#>  [46] fastmatch_1.1-4        Rcpp_1.0.12            enrichplot_1.18.4     
#>  [49] Biobase_2.58.0         jquerylib_0.1.4        vctrs_0.6.5           
#>  [52] Biostrings_2.66.0      ape_5.7-1              nlme_3.1-164          
#>  [55] ggraph_2.2.1           xfun_0.44              stringr_1.5.1         
#>  [58] lifecycle_1.0.4        clusterProfiler_4.10.0 DOSE_3.24.2           
#>  [61] org.Hs.eg.db_3.18.0    zlibbioc_1.44.0        MASS_7.3-60.0.1       
#>  [64] scales_1.3.0           tidygraph_1.3.1        vroom_1.6.5           
#>  [67] hms_1.1.3              parallel_4.2.2         RColorBrewer_1.1-3    
#>  [70] yaml_2.3.8             memoise_2.0.1          gridExtra_2.3         
#>  [73] ggplot2_3.5.1          ggfun_0.1.4            HDO.db_0.99.1         
#>  [76] yulab.utils_0.1.4      sass_0.4.9             stringi_1.8.3         
#>  [79] RSQLite_2.3.6          highr_0.11             S4Vectors_0.42.1      
#>  [82] tidytree_0.4.6         BiocGenerics_0.44.0    BiocParallel_1.32.6   
#>  [85] GenomeInfoDb_1.34.9    rlang_1.1.3            pkgconfig_2.0.3       
#>  [88] bitops_1.0-7           evaluate_0.23          lattice_0.22-6        
#>  [91] purrr_1.0.2            labeling_0.4.3         treeio_1.26.0         
#>  [94] patchwork_1.2.0        cowplot_1.1.3          shadowtext_0.1.3      
#>  [97] bit_4.0.5              tidyselect_1.2.1       plyr_1.8.9            
#> [100] magrittr_2.0.3         R6_2.5.1               IRanges_2.32.0        
#> [103] generics_0.1.3         DBI_1.2.2              pillar_1.9.0          
#> [106] withr_3.0.0            KEGGREST_1.38.0        RCurl_1.98-1.14       
#> [109] tibble_3.2.1           crayon_1.5.2           utf8_1.2.4            
#> [112] tzdb_0.4.0             rmarkdown_2.26         viridis_0.6.5         
#> [115] grid_4.2.2             data.table_1.15.4      blob_1.2.4            
#> [118] Rgraphviz_2.42.0       digest_0.6.35          tidyr_1.3.1           
#> [121] gridGraphics_0.5-1     stats4_4.2.2           munsell_0.5.0         
#> [124] viridisLite_0.4.2      ggplotify_0.1.2        bslib_0.7.0