My previous attempts to identify CIMP in TCGA colorectal and colon datasets using 27k and 450k arrays failed. The purpose of this exercise was to understand precisely how the analyses are performed in already published papers. I chose the most recent paper on CIMP in CRC from Peter Laird's group at USC: "Genome-scale analysis of aberrant DNA methylation in colorectal cancer" by T. Hinoue et al published in Genome Research in 2011. Why I chose it:
- They used Illumina 27k platform
- The descriptions of their analyses are pretty clear, in addition Dr Hinoue has provided me with a Sweave file describing some of their analyses
- They have the main clinical and technical information available: gender, age, tumor site, tumor stage and batch
- Dr Hinoue has also provided a file with the mutation status of BRAF, KRAS and P53. Supplementary files that they didn't provide with the paper and weren't uploaded to GEO can be found on their group website here.
- In the series matrix from GEO (GSE25062) they marked all probes excluded from the analysis as "null" (except those that belong to X and Y chromosomes, which was clear only from the Sweave file) which helped me in early identification of these probes on the platform instead of learning how to do it myself. It appeared later that it is very easy to do with R/Bioconductor package Genomic Ranges. Howto can be found here.
Few important points:
- They verified the clusters that they obtained using RPMM algorithm (specific for beta-distributed data) by doing logit transformation of the data (into M value) and using R/Bioconductor package ConsensusClusterPlus (link to the paper describing the algorithm. The paper is well written and easy to understand. I found that I like the package based on this publication more than I like pvclust). Therefore since I personally prefer M values I focused on reproducing their results using this package rather than RPMM.
- It wasn't clear anywhere whether their results were generated with batch/age/gender adjusted data or only batch adjusted data. Only when I requested age information from Dr Hinoue he mentioned that 25 patients do not have any age information which made me realize that they didn't adjust for it. I wanted to understand the difference in obtained clusters with and without age/gender adjustment.