Using the Package
Get the source from Github and build the package:
Code Block |
---|
# Install the dependencies, from R:
> install.packages(pkgs=c("WGCNA", "flashClust", "dynamicTreeCut"))
# now, from the command line, clone the Github repository
git clone https://github.com/Sage-Bionetworks/SageBionetworksCoex.git
# again, from the command line, build the package
R CMD INSTALL SageBionetworksCoex
|
In R, load the library:
Code Block |
---|
> library(SageBionetworksCoex)
|
For guidance on using the package:
Code Block |
---|
?SageBionetworksCoex
|
Table of Contents
Table of Contents |
---|
Goals
1) Make the Sage coexpression software runnable by any data analyst in R.
Document the analysis steps, package functions, and adjustable parameters. Write start-to-finish vignette(s) having sufficient detail to guide a new analyst through the software, including making parameter settings and other intermediate decisions. Refactoring software to enhance readability and/or to make decision points and adjustable parameters explicit, is within scope for this goal.
2) Clearly explain the methodology underlying the coexpression algorithms.
This includes contrasting the Sage algorithms with the separately published “WGCNA” algorithms and explaining the rationale for the differences and the decision process for choosing which to use.
3) Make the Sage coexpression software publicly available.
The code must be available through a standard R distribution channel (CRAN or Bioconductor). An option is to merge Sage-specific algorithms or steps into WGCNA (if the two are sufficiently overlapping).
4) Make the Sage coexpression software perform well, on commonly available hardware.
Currently, special high capacity hardware is required to run the Sage coexpression software on commonly encountered data set sizes. The goal is to optimize the code to run on commonly available hardware, at a minimum the code should be runnable on the 68GB, quad core servers available through Amazon Web Services. Ideally, the code would be runnable on a “heavy” Sage laptop (8 GB total RAM).
Strategy
The steps for Sage Coexpression are:
- Compute correlation coefficient matrix.
- Determine optimal value for the scale free exponent, beta, and collect regression statistics.
- Compute the topological overlap matrix (TOM).
- Perform hierarchical clustering of genes, based on TOM.
- Detect and label modules in TOM, using "Dynamic Tree Cutting".
- Merge modules based on hierarchical clustering of representative genes.
- Cluster samples hierarchically.
- Compute intra/inter-module network statistics, per gene.
- Produce diagnostic plots (dendrograms, heat maps, statistical scatter plots).
- Produce tabular output of module membership, network statistics, and scale-free regression statistics.
...
- leverage the UCLA-WGCNA package for the "common" steps, 1->5, gaining significant performance
- provide the user a parameter choice at step 5, to do "tree cutting" in the manner of the Sage algorithm, or in that of the UCLA-WGCNA algorithm
- provide two algorithms for step 6 (module merging), allowing a user to choose the Sage or UCLA-WGCNA algorithm
- leverage the UCLA-WGCNA dendrogram/module plotting algorithm in step 9
- maintain the Sage algorithms for the Sage-specific post-processing, i.e. step 7, statistics in step 8 , the and the heat maps in step 9. Accelerate step 8 using compiled code.
External dependencies
WGCNA::cor -- the compiled/accelerated Pearson correlation computation
...
module merging, by analyzing most-highly-connected genes in each modulefixed-cluster-number tree cutting for sample module definition
computation of within- and between- module per-gene connectivity statistics (but using additional compiled code to dramatically speed up analysis of large modules)
heatmap generation for correlation and TOM matrices
Reduction of scale-free regression threshold (R^2) when no solution is found.
The Package
The source code for the created package is available on our Atlassian-hosted SVN repository, under the SCICOMP project:
...
Correlation computation is faster, with identical results.
TOM computation is faster, with identical results.
Hierarchical clustering is faster, with identical results.
Scale-free exponent (beta) determination is similar, with very similar results and regression statistics.
"Dynamic tree cutting" algorithm is the same, with very similar results.
Intramodular statistics are faster, with identical results.
Diagnostic plot set is reduced from 12 to 86, omitting redundant plots.
Sample clustering was omitted, having been deemed unnecessary.
Additional Features
Option to do tree cutting and/or subsqeuent merging by UCLA-WGCNA algorithm or by 'Sage classic' algorithm.
Separation of rote, time consuming steps (correlation, TOM computation) from tree cutting.
Separation of analysis from plotting.
Separation of analysis from file system, to facilitate Synpase integration.
Performance and Limits
Coexpression has O(n^3) ("cubic") time complexity and O(n^2) ("quadratic") space complexity, where n is the number of probes in the dataset. The time complexity is primarily due to the TOM computation. The space complexity is due to the need to hold the nXn correlation and TOM matrices in memory. The R language has an inherent limit on the size of a vector or matrix of about 2 billion elements (http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html). This means the maximum size of a square matrix is 46340 x 46340. A square matrix in R requires 8n^2 bytes. (Values are double precision and R has no provision for single precision representation of floating point values.) Such a matrix requires 16.78 GB of contiguous memory. The coexpression package uses two nXn matrices (as mentioned above) and at times creates temporary variables at sizes equal to or a large fraction of the nXn matrix size. Therefore, to support maximum dataset sizes, machines used to run coexpression should have dozens of GB of memory. In the empirical evaluations summarized below, we use Amazon Web Services' "High-Memory Quadruple Extra Large" machines having 68 GB RAM, each of which costs (at the time this is written) $2.88/hour (http://aws.amazon.com/ec2/instance-types/).
Evaluation
Dataset were run through the original Sage coexpression code and the newly created package to compare performance and similiarity of results.
Similiarity questions: Is the scale-free exponent 'beta' the same? Are the dendrograms the same? How similar are the resultant modules?
Performance questions: For datasets having >18,000 probes, how much time and space does each algorithm use?
Dataset | # Probes | # Samples | Sage Time | Sage Space | Package Time | Package Space | Sage beta | Package beta | Gene trees same, independent beta? | Module difference (%) Gene trees same, same beta? | Module difference****, independent beta | Module difference****, same beta | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Female mouse liver | 3600 | 135 | --- | --- | --- | --- | 6.5 | 6.5 | YES TRUE | TRUE | 3.7% | 3.7% | ||||||||
Cranio | 2534 | 249 | --- | --- | --- | --- | 4.0 | 4.5 | YES | 44% / FALSE | TRUE | 44% | 0.9%* | |||||||
Methylation, top 5K genes | 5000 | 555 | --- | --- | --- | --- | 8.5 | 8.5 | YES TRUE | TRUE | 0 | 0 | ||||||||
Colon cancer, top 5K genes | 5000 | 322 | --- | --- | --- | --- | 3 | 3.5 | FALSE | YES TRUE | 11% / | 0.5%* | ||||||||
Human liver cohort, top 5K genes | 5000 | 427 | --- | --- | --- | --- | 6.5 | 5.5 | YES | 33% / 0* | 11 | 11 | TRUE | TRUE | 1.0% | 1.0% | ||||
PARC* | 18,392 | 960 | 5h:55m | 83.9 GB | 1h:40m | 71 GB | 8 | 7.5 | FALSE | FALSE | 4.7% | 0.6% | ||||||||
Methylation (full set)* | 27,578 | 555 |
|
|
|
|
|
|
|
| 24h:45m | 180 GB | 6h:38m | 196 GB | 8 | 11.5 | FALSE | FALSE | 14% | 0.2% |
Colon cancer, top 40000 45K genes*** | 4045,000 | 322 |
|
|
|
|
|
|
|
| --- | --- | 5h:52 | 368 GB | --- | --- | --- | --- | --- | --- |
Human liver cohort*** | 40,102 | 427 |
|
|
|
|
|
|
|
|
...
--- | --- | 5h:13m | 313 GB | --- | --- | --- | --- | --- | --- |
* These were run on an Amazon Elastic Compute Cloud (EC2) "High-Memory Quadruple Extra Large" unix server, having 68GB of RAM.
** UCLA-WGCNA package also runs out of memory. (An alternative is to use the WGCNA preprocessing step of K-means decomposition, which has been shown to work with >50K genes.)
*** Run on Sage Bionetworks' "Belltown" Unix server, having 256GB RAM.
**** http://florence.acadiau.ca/collab/hugh_public/index.php?title=R:compare_partitions
Note: We skip the performance evaluation for the small data sets.
The details of the differences summarized in the table can be found here: http://sagebionetworks.jira.com/wiki/display/SCICOMP/Package+Comparison+Details
Conclusions
The new package has considerably better time performance than does the original code. Though the two algorithms have the same approach for computing 'beta', the results can vary greatly. When beta is the same, the dendrograms and modules are similar or identical. However, module determination is very sensitive to beta, which can vary greatly with small changes in regression statistics, as can be seen on the 'details' page.
Goals, Revisited
Goal | How we met it |
---|---|
Make the Sage coexpression software runnable by any data analyst in R | Created easy to use, documented R package. (TODO: Vignette, training class) |
Clearly explain the methodology underlying the coexpression algorithms. | Included links to literature in the R package documentation. |
Make the Sage coexpression software publicly available. | TBD (see below) |
Make the Sage coexpression software perform well, on commonly available hardware. | Used UCLA's accelerated algorithms. Accelerated the 'intra-module statistics' computation. Profiled datasets of up to 27,000 genes on inexpensive, high capacity cloud resources. |
Choices for package 'publication' include:
- Making available on Sage website
- Making available in Synapse
- Publish to BioConductor
- Publish to CRAN
- Request that UCLA include as part of WGCNA package