Coexpression Evaluation

Coexpression Evaluation

Objective

The objective of this study is to compare the Sage Coexpression algorithm to the WGCNA coexpression algorithm, in terms of (1) time and memory usage, (2) similarity of output.

Preparation

We started up an instance of the BioConductor AMI, as described here:

http://bioconductor.org/help/bioconductor-cloud-ami/

We installed the UCLA WGCNA code from CRAN as described here:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/

We installed the Sage Coexpression code as follows:

1) We installed a version of main script, modified to become an R function.  This script is now located under SCICOMP/trunk/Coexpression/Profiling/R/SageCoexpr.R.  We added an option to use the UCLA-accelerated correlation computation function, as opposed to the organic stats::cor function.  To capture time stamps, we invoked the R function proc.time() at various points in the function, as described below.

2) We installed a modified version of R-tomfunctions.R, in which global variables were converted to function parameters.  This is necessary when calling the functions from the main script as a function.  (The converted code can be found under SCICOMP/trunk/Coexpression/SageBionetworksCoex/R.)

3) We installed the required libraries:

- We installed these from CRAN:  "MASS", "class", "MVA", "cluster", "survival", "rpart", "lattice", "scatterplot3d", "impute", "Hmisc"

- We installed the "sma" package from our local unix system

We created a modified version of the top level WGCNA function "blockwiseModules" with time stamps at various points, as described below.  This modified function is now located under SCICOMP/trunk/Coexpression/Profiling/R/blocwiseModulesTimeStamped.R.

We wrote scripts to execute the Sage and WGCNA coexpression code from Unix, wrapped in the Unix 'time' command to capture time and space usage.

Computing Resources

For this study we used Amazon's Elastic Compute Cloud (EC2), since it allows us (1) to access servers as needed, allowing parallel execution, and (2) it allows experiments to be uninfluenced by other computations using Sage's servers.  We used and EC2 "Large Instance" for a pilot study, and a "High-Memory Quadruple Extra Large Instance" for the main study.  Details of these instance types are given here:

http://aws.amazon.com/ec2/instance-types/

Data Sets

Pilot Study:  We used the "Female Mouse Liver" data set from the WGCNA tutorial, having 3600 probes and 135 samples, running on an EC-2          "Large Instance", with 7.5 GB memory.

We used a colon cancer data set having 54,675 probe sets (Affy U133Plus2 array pattern) and 322 samples, provided by Justin Guinney.
We used a PARC data set having 18,392 probes (Illumina Ref8v3 bead array) and 960 samples (two diff treatments on ea. of 480 samples), provided by Lara Mangravite.  We also ran the "control" and "treated" halves of this dataset independently.

We used a methylation dataset having 27,578 probes (Illumina Human Methylation 27k array) and 555 samples, provided by Vitalina Komashko.
The latter three datasets were run on EC2 High-Memory Quadruple Extra Large Instances.

We used a dataset called "Cranio" with 2534 genes and 249 samples.

The input and output data are available in this folder:

/work/platform/coexpressionEval/

Execution Steps

The four steps are (1) preprocess the data, (2) run the Sage coexpression code, (3) run the WGCNA code, (4) compare the output.  The details for executing these steps are available under SCICOMP/trunk/Coexpression/Profiling/README.txt

Results

This table summarizes the study results.  Details for each data set are on separate pages linked below. ("Sage-mod" refers to the Sage Coexpression algorithm with the correlation computation replaced by the accelerated WGCNA correlation computation.)

Dataset

# Probes

# Samples

Sage time

Sage mem (GB)

Sage-mod time

Sage-mod space (GB)

WGCNA time

WGCNA space (GB)

Gene partition difference

Dataset

# Probes

# Samples

Sage time

Sage mem (GB)

Sage-mod time

Sage-mod space (GB)

WGCNA time

WGCNA space (GB)

Gene partition difference

Female mouse liver

3600

135

11m:36s

6.5

10m:58s

6.1

1m:29s

1.9

7.4%

Colon cancer

54,675

322

NA

NA

NA

NA

2h:47m:29s*

36.2*

NA

Methylation

27,578

555

24h:45m:34s

180

19h:40m:23s

187.1

5h:11m:05s

101

9.5%

PARC, all samples

18,392

960

5h:55m:16s

83.9

4h:25m:27s

84.6

1h:38m:55s

51.6

20.9%

PARC, control

18,392

480

4h:53m:15s

75.7

4h:05m:44s

83.8

1h:32m:55s

46.7

25.2%

PARC, treated

18,392

480

4h:52m:21s

76.0

4h:07m:34s

83.5

1h:32m:52s

46.8

30.9%

Cranio

2534

249

5:19

3.6

4:52

3.6

0:34

1.4

42.8%

* Fails to complete when run as a single block of genes.  Completes with these statistics when run under the WGCNA option to split genes into 'blocks' of 10,000. 

 Note:  The 'gene partition difference' algorithm is described here: http://florence.acadiau.ca/collab/hugh_public/index.php?title=R:compare_partitions

Female mouse liver:  (Note, on this page we also show the time/space for multiple runs and the effect of using the WGCNA K-means division preprocessing step.)

http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+--+3600+probe+tutorial+data+set

Colon cancer:

http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+Colon+cancer

Methylation:

http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+Methylation

PARC (all samples):

http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+PARC+all+samples

PARC (control samples):

http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+PARC+control+samples