Using the Package

Get the source from Github and build the package:

Code Block

# Install the dependencies, from R:
> install.packages(pkgs=c("WGCNA", "flashClust", "dynamicTreeCut"))
# now, from the command line, clone the Github repository
git clone https://github.com/Sage-Bionetworks/SageBionetworksCoex.git
# again, from the command line, build the package
R CMD INSTALL SageBionetworksCoex

In R, load the library:

Code Block
> library(SageBionetworksCoex)

For guidance on using the package:

Code Block
?SageBionetworksCoex

Goals

1) Make the Sage coexpression software runnable by any data analyst in R.

Document the analysis steps, package functions, and adjustable parameters. Write start-to-finish vignette(s) having sufficient detail to guide a new analyst through the software, including making parameter settings and other intermediate decisions. Refactoring software to enhance readability and/or to make decision points and adjustable parameters explicit, is within scope for this goal.

2) Clearly explain the methodology underlying the coexpression algorithms.

This includes contrasting the Sage algorithms with the separately published “WGCNA” algorithms and explaining the rationale for the differences and the decision process for choosing which to use.

3) Make the Sage coexpression software publicly available.

The code must be available through a standard R distribution channel (CRAN or Bioconductor). An option is to merge Sage-specific algorithms or steps into WGCNA (if the two are sufficiently overlapping).

4) Make the Sage coexpression software perform well, on commonly available hardware.

Currently, special high capacity hardware is required to run the Sage coexpression software on commonly encountered data set sizes. The goal is to optimize the code to run on commonly available hardware, at a minimum the code should be runnable on the 68GB, quad core servers available through Amazon Web Services. Ideally, the code would be runnable on a “heavy” Sage laptop (8 GB total RAM).

Strategy

The steps for Sage Coexpression are:

...

Performance questions: For datasets having >18,000 probes, how much time and space does each algorithm use?

Dataset	# Probes	# Samples	Sage Time	Sage Space	Package Time	Package Space	Sage beta	Package beta	Gene trees same, independent beta?	Gene trees same, same beta?	Module difference****, independent beta	Module difference****, same beta
Female mouse liver	3600	135	---	---	---	---	6.5	6.5	TRUE	TRUE	3.7%	3.7%
Cranio	2534	249	---	---	---	---	4.0	4.5	FALSE	TRUE	44%	0.9%
Methylation, top 5K genes	5000	555	---	---	---	---	8.5	8.5	TRUE	TRUE	0	0
Colon cancer, top 5K genes	5000	322	---	---	---	---	3	3.5	FALSE	TRUE	11%	0.5%
Human liver cohort, top 5K genes	5000	427	---	---	---	---	11	11	TRUE	TRUE	1.0%	1.0%
PARC*	18,392	960	5h:55m	83.9 GB	1h:40m	71 GB	8	7.5	FALSE	FALSE	4.7%	0.6%
Methylation (full set)*	27,578	555	24h:45m	180 GB	6h:38m	196 GB	8	11.5	FALSE	FALSE	14%	0.2%
Colon cancer, top 40K 45K genes***	4045,000	322 Out of memory**	---	---	Out of memory after 16h:12m**	276 GB 5h:52	368 GB	---	---	---	---	---	---
Human liver cohort***	40,102	427 Out of memory**	---	---	Out of memory after 16h:46m**	276 5h:13m	313 GB	---	---	---	---	---	---

* These were run on an Amazon Elastic Compute Cloud (EC2) "High-Memory Quadruple Extra Large" unix server, having 68GB of RAM.

** Note: UCLA-WGCNA package also runs out of memory. (An alternative is to use the WGCNA preprocessing step of K-means decomposition, which has been shown to work with >50K genes.)

*** Run on Sage Bionetworks' "Belltown" Unix server, having 256GB RAM.

**** http://florence.acadiau.ca/collab/hugh_public/index.php?title=R:compare_partitions
Note: We skip the performance evaluation for the small data sets.

...

Goals, Revisited

Goal	How we met it
Make the Sage coexpression software runnable by any data analyst in R	Created easy to use, documented R package. (TODO: training class)
Clearly explain the methodology underlying the coexpression algorithms.	Included links to literature in the R package documentation.
Make the Sage coexpression software publicly available.	TBD (see below)
Make the Sage coexpression software perform well, on commonly available hardware.	Used UCLA's accelerated algorithms. Accelerated the 'intra-module statistics' computation. Profiled datasets of up to 27,000 genes on inexpensive, high capacity cloud resources.

Choices for package 'publication' include:

...

Versions Compared

Old Version 59

New Version Current

Key

Using the Package

Table of Contents

Goals

Strategy

Goals, Revisited

Page Comparison

Versions Compared

Old Version 59

New Version Current

Key

Using the Package

Table of Contents

Goals

Strategy

Goals, Revisited