Coexpression has O(n^3) ("cubic") time complexity and O(n^2) ("quadratic") space complexity, where n is the number of probes in the dataset. The time complexity is primarily due to the TOM computation. The space complexity is due to the need to hold the nXn correlation and TOM matrices in memory. The R language has an inherent limit on the size of a vector or matrix of about 2 billion elements (http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html). This means the maximum size of a square matrix is 46340 x 46340. A square matrix in R requires 8n^2 bytes. (Values are double precision and R has no provision for single precision representation of floating point values.) Such a matrix requires 16.78 GB of contiguous memory. The coexpression package uses two nXn matrices (as mentioned above) and at times creates temporary variables at sizes equal to or a large fraction of the nXn matrix size. Therefore, to support maximum dataset sizes, machines used to run coexpression should have dozens of GB of memory. In the empirical evaluations summarized below, we use Amazon Web Services' "High-Memory Quadruple Extra Large" machines having 68 GB RAM, each of which costs (at the time this is written) $2.88/hour (http://aws.amazon.com/ec2/instance-types/).

Evaluation

Dataset were run through the original Sage coexpression code and the newly created package to compare performance and similiarity of results.

Similiarity questions: Is the scale-free exponent 'beta' the same? Are the dendrograms the same? How similar are the resultant modules?

Performance questions: For datasets having >18,000 probes, how much time and space does each algorithm use?

Dataset	# Probes	# Samples	Sage Time	Sage Space	Package Time	Package Space	Sage beta	Package beta	Gene trees same, independent beta?	Gene trees same, same beta?	Module difference, independent beta	Module difference, same beta
Female mouse liver	3600	135	---	---	---	---	6.5	6.5	TRUE	TRUE	3.7%	3.7%
Cranio	2534	249	---	---	---	---	4.0	4.5	FALSE	TRUE	44%	0.9%
Methylation, top 5K genes	5000	555	---	---	---	---	8.5	8.5	TRUE	TRUE	0	0
Colon cancer, top 5K genes	5000	322	---	---	---	---	3	3.5	FALSE	TRUE	11%	0.5%
Human liver cohort, top 5K genes	5000	427	---	---	---	---	11	11	TRUE	TRUE	1.0%	1.0%
PARC*	18,392	960	5h:55m	83.9 GB	1h:40m	71 GB	8	7.5	FALSE	FALSE	4.7%	0.6%
Methylation (full set)*	27,578	555	24h:45m	180 GB	13h:20m	196 GB	8	11.5	FALSE	FALSE	14%	0.2%
Colon cancer, top 40K genes*	40,000	322	Out of memory**	---	Out of memory**	---	---	---	---	---	---	---
Human liver cohort*	40,102	427	Out of memory**	---	Out of memory**	---	---	---	---	---	---	---

Conclusions: The new package has considerably better time performance. Though the two algorithms have the same approach for computing 'beta', the results can vary greatly. When beta is the same, the dendrograms and modules are similar or identical. However, module determination is very sensitive to beta, which can vary greatly with small changes in regression statistics, as can be seen here: http://sagebionetworks.jira.com/wiki/display/SCICOMP/Package+Comparison+Details

* These were run on an Amazon Elastic Compute Cloud (EC2) "High-Memory Quadruple Extra Large" unix server, having 68GB of RAM.

...

Versions Compared

Old Version 51

New Version 52

Key

Evaluation

Page Comparison

Versions Compared

Old Version 51

New Version 52

Key

Evaluation