Skip to end of banner
Go to start of banner

Coexpression Package

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Goals

1) Make the Sage coexpression software runnable by any data analyst in R.

Document the analysis steps, package functions, and adjustable parameters.  Write start-to-finish vignette(s) having sufficient detail to guide a new analyst through the software, including making parameter settings and other intermediate decisions.  Refactoring software to enhance readability and/or to make decision points and adjustable parameters explicit, is within scope for this goal.

2) Clearly explain the methodology underlying the coexpression algorithms.

This includes contrasting the Sage algorithms with the separately published “WGCNA” algorithms and explaining the rationale for the differences and the decision process for choosing which to use.

3) Make the Sage coexpression software publicly available.

The code must be available through a standard R distribution channel (CRAN or Bioconductor).  An option is to merge Sage-specific algorithms or steps into WGCNA (if the two are sufficiently overlapping).

4) Make the Sage coexpression software perform well, on commonly available hardware.

Currently, special high capacity hardware is required to run the Sage coexpression software on commonly encountered data set sizes.  The goal is to optimize the code to run on commonly available hardware, at a minimum the code should be runnable on the 68GB, quad core servers available through Amazon Web Services. Ideally, the code would be runnable on a “heavy” Sage laptop (8 GB total RAM).

Strategy

The steps for Sage Coexpression are: 

  1. Compute correlation coefficient matrix.
  2. Determine optimal value for the scale free exponent, beta, and collect regression statistics.
  3. Compute the topological overlap matrix (TOM).
  4. Perform hierarchical clustering of genes, based on TOM.
  5. Detect and label modules in TOM, using "Dynamic Tree Cutting".
  6. Merge modules based on hierarchical clustering of representative genes.
  7. Cluster samples hierarchically.
  8. Compute intra/inter-module network statistics, per gene.
  9. Produce diagnostic plots (dendrograms, heat maps, statistical scatter plots).
  10. Produce tabular output of module membership, network statistics, and scale-free regression statistics.

Elsewhere http://sagebionetworks.jira.com/wiki/display/SCICOMP/Coexpression+Evaluation we have shown that for realistic dataset sizes, the vast majority of time is spent in steps 1 and 3, and that steps 1, 3, and 4 are identical in the Sage and UCLA-WGCNA code bases, while the UCLA-WGCNA code uses compiled/optimized software for these three steps.  Further, we have seen that steps 2 and 5 (with the right parameter choices in the UCLA package) produce extremely similar results.  (Note, the observed similarity is no surprise, since the two code bases represent forks from an original set of algorithms, which have evolved separately for appx. 6 years.)

Our strategy, therefore, is:

leverage the UCLA-WGCNA package for the "common" steps, 1->5, gaining significant performance

provide the user a parameter choice at step 5, to do "tree cutting" in the manner of the Sage algorithm, or in that of the UCLA-WGCNA algorithm

provide two algorithms for step 6 (module merging), allowing a user to choose the Sage or UCLA-WGCNA algorithm

leverage the UCLA-WGCNA dendrogram/module plotting algorithm in step 9

maintain the Sage algorithms for the Sage-specific post-processing, i.e. step 7, step 8, the heat maps in step 9.

UCLA-WGCNA dependencies

Sage software dependencies

Package software

Evaluation

Dataset

# Probes

# Samples

 

Sage Time

Sage Space

Package Time

Package Space

Sage Beta

Package Beta

Gene trees same?

Module difference (%)

Female mouse liver

3600

135

 

---

---

---

---

 

 

 

 


 

 

 

---

---

---

---

 

 

 

 

Methylation (gene subset)

??

555

 

 

 

 

 

 

 

 

 

Colon cancer (small gene subset)

??

322

 

---

---

---

---

 

 

 

 

Cranio

2534

249

 

---

---

---

---

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Methylation (full set)

27,578

5555

 

 

 

 

 

 

 

 

 

Colon cancer (large gene subset)

 

322

 

 

 

 

 

 

 

 

 

Human liver cohort

40,102

427

 

 

 

 

 

 

 

 

 

Note: We skip the performance evaluation for the small data sets.

  • No labels