Coexpression Evaluation
Contents
On Related Pages
Objective
The objective of this study is to compare the Sage Coexpression algorithm to the WGCNA coexpression algorithm, in terms of (1) time and memory usage, (2) similarity of output.
Preparation
We started up an instance of the BioConductor AMI, as described here:
http://bioconductor.org/help/bioconductor-cloud-ami/
We installed the UCLA WGCNA code from CRAN as described here:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/
We installed the Sage Coexpression code as follows:
1) We installed a version of main script, modified to become an R function. This script is now located under SCICOMP/trunk/Coexpression/Profiling/R/SageCoexpr.R. We added an option to use the UCLA-accelerated correlation computation function, as opposed to the organic stats::cor function. To capture time stamps, we invoked the R function proc.time() at various points in the function, as described below.
2) We installed a modified version of R-tomfunctions.R, in which global variables were converted to function parameters. This is necessary when calling the functions from the main script as a function. (The converted code can be found under SCICOMP/trunk/Coexpression/SageBionetworksCoex/R.)
3) We installed the required libraries:
- We installed these from CRAN: "MASS", "class", "MVA", "cluster", "survival", "rpart", "lattice", "scatterplot3d", "impute", "Hmisc"
- We installed the "sma" package from our local unix system
We created a modified version of the top level WGCNA function "blockwiseModules" with time stamps at various points, as described below. This modified function is now located under SCICOMP/trunk/Coexpression/Profiling/R/blocwiseModulesTimeStamped.R.
We wrote scripts to execute the Sage and WGCNA coexpression code from Unix, wrapped in the Unix 'time' command to capture time and space usage.
Computing Resources
For this study we used Amazon's Elastic Compute Cloud (EC2), since it allows us (1) to access servers as needed, allowing parallel execution, and (2) it allows experiments to be uninfluenced by other computations using Sage's servers. We used and EC2 "Large Instance" for a pilot study, and a "High-Memory Quadruple Extra Large Instance" for the main study. Details of these instance types are given here:
http://aws.amazon.com/ec2/instance-types/
Data Sets
Pilot Study: We used the "Female Mouse Liver" data set from the WGCNA tutorial, having 3600 probes and 135 samples, running on an EC-2 "Large Instance", with 7.5 GB memory.
We used a colon cancer data set having 54,675 probe sets (Affy U133Plus2 array pattern) and 322 samples, provided by Justin Guinney.
We used a PARC data set having 18,392 probes (Illumina Ref8v3 bead array) and 960 samples (two diff treatments on ea. of 480 samples), provided by Lara Mangravite. We also ran the "control" and "treated" halves of this dataset independently.
We used a methylation dataset having 27,578 probes (Illumina Human Methylation 27k array) and 555 samples, provided by Vitalina Komashko.
The latter three datasets were run on EC2 High-Memory Quadruple Extra Large Instances.
We used a dataset called "Cranio" with 2534 genes and 249 samples.
The input and output data are available in this folder:
/work/platform/coexpressionEval/
Execution Steps
The four steps are (1) preprocess the data, (2) run the Sage coexpression code, (3) run the WGCNA code, (4) compare the output. The details for executing these steps are available under SCICOMP/trunk/Coexpression/Profiling/README.txt
Results
This table summarizes the study results. Details for each data set are on separate pages linked below. ("Sage-mod" refers to the Sage Coexpression algorithm with the correlation computation replaced by the accelerated WGCNA correlation computation.)
Dataset |
# Probes |
# Samples |
Sage time |
Sage mem (GB) |
Sage-mod time |
Sage-mod space (GB) |
WGCNA time |
WGCNA space (GB) |
Gene partition difference |
---|---|---|---|---|---|---|---|---|---|
Female mouse liver |
3600 |
135 |
11m:36s |
6.5 |
10m:58s |
6.1 |
1m:29s |
1.9 |
7.4% |
Colon cancer |
54,675 |
322 |
NA |
NA |
NA |
NA |
2h:47m:29s* |
36.2* |
NA |
Methylation |
27,578 |
555 |
24h:45m:34s |
180 |
19h:40m:23s |
187.1 |
5h:11m:05s |
101 |
9.5% |
PARC, all samples |
18,392 |
960 |
5h:55m:16s |
83.9 |
4h:25m:27s |
84.6 |
1h:38m:55s |
51.6 |
20.9% |
PARC, control |
18,392 |
480 |
4h:53m:15s |
75.7 |
4h:05m:44s |
83.8 |
1h:32m:55s |
46.7 |
25.2% |
PARC, treated |
18,392 |
480 |
4h:52m:21s |
76.0 |
4h:07m:34s |
83.5 |
1h:32m:52s |
46.8 |
30.9% |
Cranio |
2534 |
249 |
5:19 |
3.6 |
4:52 |
3.6 |
0:34 |
1.4 |
42.8% |
* Fails to complete when run as a single block of genes. Completes with these statistics when run under the WGCNA option to split genes into 'blocks' of 10,000.
Note: The 'gene partition difference' algorithm is described here: http://florence.acadiau.ca/collab/hugh_public/index.php?title=R:compare_partitions
Female mouse liver: (Note, on this page we also show the time/space for multiple runs and the effect of using the WGCNA K-means division preprocessing step.)
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+--+3600+probe+tutorial+data+set
Colon cancer:
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+Colon+cancer
Methylation:
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+Methylation
PARC (all samples):
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+PARC+all+samples
PARC (control samples):
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+PARC+control+samples
PARC (treated samples):
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+PARC+treated+samples
Cranio:
http://sagebionetworks.jira.com/wiki/display/SCICOMP/Results+-+CRANIO
Key to Sage algorithm steps
- Read Input
- Correlation -- compute correlation matrix
- Scale-free Analysis -- determine the best value of beta (aka "power")
- TOM -- compute the topological overlap matrix
- HierClust -- compute the dendrogram, based on the TOM
- Module Detection -- partition the gene set into modules, based on the dendrogram
- Hier Clust of PCs -- merge modules whose centers are close
- Hier Clust of Samples
- Within-module Network Properties -- statistics on the discovered modules
- Data Output
Key to WGCNA algorithm steps
- Compute Blocks -- determine 'blocks' of genes to decompose the dataset (not used)
- TOM -- compute correlation and Topological Overlap Matrix
- HierClust -- compute the dendrogram, based on the TOM
- Module Detection -- partition the gene set into modules, based on the dendrogram
- Analyze Module Statistics
- Reassign genes, remove modules -- Check whether any of the already assigned genes should be re-assigned; Remove modules that are to be removed
Discussion
The most time intensive part of the Sage algorithm is the computation of the topological overlap matrix (TOM). This is not surprising since the formula has time complexity cubic in the number of probes in the dataset. The WGCNA package uses optimized algorithms for TOM computation and is dramatically faster. It also uses optimized algorithms for correlation computation and hierarchical clustering, which are quadratic-time, so moderate improvements are gained in those steps as well.
The Sage Coexpression and WGCNA coexpression computations are identical up through hierarchical clustering. They then have slightly different approaches to tree cutting and post processing, which lead to differences in modules.
The Sage Coexpression computation features automatic determination of the scale-free parameter, 'beta', while this parameter is a user-input to the WGCNA code. The latter provides graphical analysis to help the user select an appropriate value. The Sage Coexpression algorithm generates a number of diagnostic plots (cluster diagrams, heat maps, etc.) which are not part of the WGCNA algorithm.
The R language has an upper limit on the size of arrays and matrices, which limits the number of probes that can be analyzed together to 46,340. The WGCNA algorithm features a K-Means preprocessing step to divide the dataset into separate 'blocks' for analysis. This provides a work-around for large data sets.
The steps of the two algorithms are shown side by side:
Sage |
UCLA WGCNA |
---|---|
|
QC: check for excessive missing values, across genes and samples |
|
QC: discard outlying samples |
|
Optional: split genes into 'blocks' using k-means clustering |
compute correlation coefficient |
|
|
|
determine optimal value for beta |
select optimal value for beta interactively |
compute TOM |
compute correlation and TOM |
perform hierarchical clustering over TOM (using 'hclust') |
hierarchical clustering (using 'flashClust') |
detect and label modules in TOM (using Sage's version of 'cutreeDynamic') |
detect modules (using UCLA's version of 'cutreeDynamic') |
|
"checking modules for statistical meaningfulness" ("at least a minimum number have a correlation with the eigengene higher than a given cutoff") |
|
Check whether any of the already assigned genes should be re-assigned (I think that this has to do with moving genes between 'blocks'.) |
|
Remove modules that are to be removed (previous step left them too small) |
merge modules based on hierarchical clustering of representative genes. (Each module's "representative gene" is taken to be the most highly connected one and modules are merged if their Pearson correlation is greater than the given threshold) |
merge close modules ("merges modules whose MEs ('module eigengenes') fall on one branch of a hierarchical clustering tree") |
(module detection is done at this point) |
|
hierarchical clustering of samples |
|
compute module/network statistics |
|
output numerous image and data files |
|
Miscellaneous Notes
Creating an AMI for EC2
See this child page: Coexpression EC2 Set-Up Details
Unix utility to measure time and memory usage
From 'man time'
The default format is:
%Uuser %Ssystem %Eelapsed %PCPU (%Xtext+%Ddata %Mmax)k
%Iinputs+%Ooutputs (%Fmajor+%Rminor)pagefaults %Wswaps
U Total number of CPU-seconds that the process used directly (in user mode), in seconds.
S Total number of CPU-seconds used by the system on behalf of the process (in kernel mode), in seconds.
E Elapsed real (wall clock) time used by the process, in [hours:]minutes:seconds. <<<<<<<<<
P Percentage of the CPU that this job got. This is just user + system times divided by the total running time.
X Average amount of shared text in the process, in Kilobytes.
D Average size of the process's unshared data area, in Kilobytes.
M Maximum resident set size of the process during its lifetime, in Kilobytes. <<<<<<<<<
I Number of file system inputs by the process.
O Number of file system outputs by the process.
F Number of major, or I/O-requiring, page faults that occurred while the process was running.
R Number of minor, or recoverable, page faults.
W Number of times the process was swapped out of main memory
In our study we use the "E" parameter for reporting the time used by the algorithm and the "M" parameter for the space used.