Goals
1) Make the Sage coexpression software runnable by any data analyst in R.
Document the analysis steps, package functions, and adjustable parameters. Write start-to-finish vignette(s) having sufficient detail to guide a new analyst through the software, including making parameter settings and other intermediate decisions. Refactoring software to enhance readability and/or to make decision points and adjustable parameters explicit, is within scope for this goal.
2) Clearly explain the methodology underlying the coexpression algorithms.
This includes contrasting the Sage algorithms with the separately published “WGCNA” algorithms and explaining the rationale for the differences and the decision process for choosing which to use.
3) Make the Sage coexpression software publicly available.
The code must be available through a standard R distribution channel (CRAN or Bioconductor). An option is to merge Sage-specific algorithms or steps into WGCNA (if the two are sufficiently overlapping).
4) Make the Sage coexpression software perform well, on commonly available hardware.
Currently, special high capacity hardware is required to run the Sage coexpression software on commonly encountered data set sizes. The goal is to optimize the code to run on commonly available hardware, at a minimum the code should be runnable on the 68GB, quad core servers available through Amazon Web Services. Ideally, the code would be runnable on a “heavy” Sage laptop (8 GB total RAM).
Strategy
The steps for Sage Coexpression are:
- Compute correlation coefficient matrix.
- Determine optimal value for the scale free exponent, beta, and collect regression statistics.
- Compute the topological overlap matrix (TOM).
- Perform hierarchical clustering of genes, based on TOM.
- Detect and label modules in TOM, using "Dynamic Tree Cutting".
- Merge modules based on hierarchical clustering of representative genes.
- Cluster samples hierarchically.
- Compute intra/inter-module network statistics, per gene.
- Produce diagnostic plots (dendrograms, heat maps, statistical scatter plots).
- Produce tabular output of module membership, network statistics, and scale-free regression statistics.
Elsewhere http://sagebionetworks.jira.com/wiki/display/SCICOMP/Coexpression+Evaluation we have shown that for realistic dataset sizes, the vast majority of time is spent in steps 1 and 3, and that steps 1, 3, and 4 are identical in the Sage and UCLA-WGCNA code bases, while the UCLA-WGCNA code uses compiled/optimized software for these three steps. Further, we have seen that steps 2 and 5 (with the right parameter choices in the UCLA package) produce extremely similar results. (Note, the observed similarity is no surprise, since the two code bases represent forks from an original set of algorithms, which have evolved separately for appx. 6 years.)
Our strategy, therefore, is:
- leverage the UCLA-WGCNA package for the "common" steps, 1->5, gaining significant performance
- provide the user a parameter choice at step 5, to do "tree cutting" in the manner of the Sage algorithm, or in that of the UCLA-WGCNA algorithm
- provide two algorithms for step 6 (module merging), allowing a user to choose the Sage or UCLA-WGCNA algorithm
- leverage the UCLA-WGCNA dendrogram/module plotting algorithm in step 9
- maintain the Sage algorithms for the Sage-specific post-processing, i.e. step 7, step 8, the heat maps in step 9.
External dependencies
WGCNA::cor -- the compiled/accelerated Pearson correlation computation
WGCNA::pickSoftThreshold -- optimal choice of scale free exponent
WGCNA::Tomdist -- the compiled/accelerated TOM computation
flashClust::flashClust -- the compiled/accelerated hierarchical clustering computation
WGCNA::scaleFreePlot -- scatter plot of network connectivity, with regression line
dynamicTreeCut::cutreeDynamic -- tree cutting / module determination
WGCNA::plotDendroAndColors -- graphic function to plot dendrogram with colored modules aligned underneath
Sage software dependencies
module merging, by analyzing most-highly-connected genes in each module
fixed-cluster-number tree cutting for sample module definition
computation of within- and between- module per-gene connectivity statistics
heatmap generation for correlation and TOM matrices
The Package
The source code for the created package is available on our Atlassian-hosted SVN repository, under the SCICOMP project:
http://sagebionetworks.jira.com/source/browse/SCICOMP/trunk/Coexpression/SageBionetworksCoex
Package functions
The main 'points of entry' into the package are:
performCoexFromFiles - run the entire package using a file of gene expression data as input
performCoexpressionAnalysis - run the analysis portion of Coexpression, taking a data frame as input
clusterGenes - run the rote, time consuming portion of Coexpression, taking a data frame as input
modulesFromGeneTree - choose modules from a gene dendrogram, taking the output of 'clusterGenes' as input
clusterSamples, intraModularStatistics - analysis steps auxiliary to module determination
createDiagnosticPlots - create dendrogram, heatmap plots, etc., from the results of the coexpression analysis
Package features
Comparison to Sage code base
Correlation computation is faster, with identical results.
TOM computation is faster, with identical results.
Hierarchical clustering is faster, with identical results.
Scale-free exponent (beta) determination is similar, with very similar results and regression statistics.
"Dynamic tree cutting" algorithm is the same, with very similar results.
Diagnostic plot set is reduced from 12 to 8, omitting redundant plots.
Additional Features
Option to do tree cutting and/or subsqeuent merging by UCLA-WGCNA algorithm or by 'Sage classic' algorithm.
Separation of rote, time consuming steps (correlation, TOM computation) from tree cutting.
Separation of analysis from plotting.
Separation of analysis from file system, to facilitate Synpase integration.
Evaluation
Dataset |
# Probes |
# Samples |
Sage Time |
Sage Space |
Package Time |
Package Space |
Sage beta |
Package beta |
Gene trees same? |
Module difference (%) |
---|---|---|---|---|---|---|---|---|---|---|
Female mouse liver |
3600 |
135 |
--- |
--- |
--- |
--- |
6.5 |
6.5 |
YES |
3.7% |
Cranio |
2534 |
249 |
--- |
--- |
--- |
--- |
4.0 |
4.5 |
YES |
44% / 0.9%* |
Methylation, top 5K genes |
5000 |
555 |
--- |
--- |
--- |
--- |
8.5 |
8.5 |
YES |
0 |
Colon cancer, top 5K genes |
5000 |
322 |
--- |
--- |
--- |
--- |
3 |
3.5 |
YES |
11% / 0.5%* |
Human liver cohort, top 5K genes |
5000 |
427 |
--- |
--- |
--- |
--- |
6.5 |
5.5 |
YES |
33% / 0* |
Methylation (full set)** |
27,578 |
555 |
|
|
|
|
|
|
|
|
Colon cancer, top 40K genes** |
40,000 |
322 |
|
|
|
|
|
|
|
|
Human liver cohort** |
40,102 |
427 |
|
|
|
|
|
|
|
|
* Force beta in Package to match beta in Sage code.
** These were run on an Amazon Elastic Compute Cloud (EC2) "High-Memory Quadruple Extra Large" unix server, having 68GB of RAM.
Note: We skip the performance evaluation for the small data sets.
The details of the differences summarized in the table can be found here: http://sagebionetworks.jira.com/wiki/display/SCICOMP/Package+Comparison+Details
Goals, Revisited
Goal |
How we met it |
---|---|
Make the Sage coexpression software runnable by any data analyst in R |
Created easy to use, documented R package. (TODO: Vignette, training class) |
Clearly explain the methodology underlying the coexpression algorithms. |
Included links to literature in the R package. |
Make the Sage coexpression software publicly available. |
TBD |
Make the Sage coexpression software perform well, on commonly available hardware. |
Used UCLA's accelerated algorithms. Profiled datasets of up to 27,000 genes on inexpensive, high capacity cloud resources. |
Add Comment