Re-analysis of the data presented in Hinoue 2012

My previous attempts to identify CIMP in TCGA colorectal and colon datasets using 27k and 450k arrays failed. The purpose of this exercise was to understand precisely how the analyses are performed in already published papers. I chose the most recent paper on CIMP in CRC from Peter Laird's group at USC: "Genome-scale analysis of aberrant DNA methylation in colorectal cancer" by T. Hinoue et al published in Genome Research in 2011. Why I chose it:

They used Illumina 27k platform
The descriptions of their analyses are pretty clear, in addition Dr Hinoue has provided me with a Sweave file describing some of their analyses
They have the main clinical and technical information available: gender, age, tumor site, tumor stage and batch
Dr Hinoue has also provided a file with the mutation status of BRAF, KRAS and P53. Supplementary files that they didn't provide with the paper and weren't uploaded to GEO can be found on their group website here.
In the series matrix from GEO (GSE25062) they marked all probes excluded from the analysis as "null" (except those that belong to X and Y chromosomes, which was clear only from the Sweave file) which helped me in early identification of these probes on the platform instead of learning how to do it myself. It appeared later that it is very easy to do with R/Bioconductor package Genomic Ranges. Howto can be found here.

Few important points:

They verified the clusters that they obtained using RPMM algorithm (specific for beta-distributed data) by doing logit transformation of the data (into M value) and using R/Bioconductor package ConsensusClusterPlus (link to the paper describing the algorithm. The paper is well written and easy to understand. I found that I like the package based on this publication more than I like pvclust). Therefore since I personally prefer M values I focused on reproducing their results using this package rather than RPMM.
It wasn't clear anywhere whether their results were generated with batch/age/gender adjusted data or only batch adjusted data. Only when I requested age information from Dr Hinoue he mentioned that 25 patients do not have any age information which made me realize that they didn't adjust for it. I wanted to understand the difference in obtained clusters with and without age/gender adjustment.

Brief unorganized description of the data:

Total number of patients: 154. Tumor: 125, normal: 29. Tumor patients cam from 2 batches broken down into 100 and 25 patients. Gender information is available for all patients. Age information is available only for 100 patients that came from a single batch. "Raw" data is represented as an output from Genome Studio, can be downloaded from GEO. They sequences BRAF, KRAS and TP53, they also profiled gene expression for 25 patients (don't know the tumor/normal distribution).

Probes: 5060 probes in the series matrix are masked as "null" and supposedly they represent the probes which CpG site and 5 bp around it has SNPs, overlap with repetitive regions, not uniquely aligned at 20 nt at the 3' term of the probe, overlap with regions of deletions and insertions in the human genome, those that have P value >=0.05. Apparently that doesn't include probes from X and Y chromosomes because those probes were removed later as indicated in the Sweave file. This wasn't clear right away. When I removed the "null" probes and XY probes I ended up with the same number (?) as indicated in the Sweave document.

Outline:

Hinoue data: 125 tumor patients, batch removed, M value, remove 5060 "null probes" identified from the series matrix. Take 10% the most variable probes. Pvclust to identify cluster stability. Parameters: euclidean distance, hierarchical clustering, complete linkage, 10000 bootstraps

The best results I can get is 69 to 83 AU value for cluster stability even after 10,000 bootstraps. Does it mean that the clusters are weak and they just accepted it as is?

Will using a different package make any difference? Use ConsensusClusterPlus package with the same parameters as pvclust. Hierarchical clustering, evaluate 20 clusters, use 80% of the data for bootstrapping. They claimed that the identified 4 clusters.

It seems that there is some separation but the size of the clusters is very uneven, nothing like was presented in the paper.

After receiving the Sweave file I found that 5060 "null" probes don't include probes from XY chromosomes. Then I took the most variable probes I used for clustering and identified that 25% of those are from X or Y chromosomes.

Next:

Carefully identify the probes, remove the from the raw data (5060 + ~800).
Create two datasets: batch only normalized (125 tumor patients) and gender,age normalized (no need to remove batch since they all came from a single batch; 100 patients).
Repeat ConsensusClusterPlus with the most variable probes (HC, complete linkage, euclidean distance). Use 10% of the original probe number as described in the Sweave document. They followed the vignette directly.
Repeat ConsensusClusterPlus using K-means, K=2:6, Pearson correlation. Use 10% of the original probe number as described in the Sweave document.

Batch only normalized, HC, euclidean distance.

May be there are like 5 or 6 clusters but not 4. It definitely doesn't look the the clusters identified in the paper.

Gender and age adjusted, HC, euclidean distance.

This looks significantly worse.

K-means clustering, Pearson correlation and the seed value provided in the package. Batch removed:

> icl[["clusterConsensus"]]
      k cluster clusterConsensus
 [1,] 2       1        0.8790997
 [2,] 2       2        0.8825873
 [3,] 3       1        0.8900365
 [4,] 3       2        0.8615545
 [5,] 3       3        0.8763479
 [6,] 4       1        0.7145161
 [7,] 4       2        0.8060535
 [8,] 4       3        0.9781181
 [9,] 4       4        0.9889430
[10,] 5       1        0.8289404
[11,] 5       2        0.7577152
[12,] 5       3        0.8221909
[13,] 5       4        0.7363796
[14,] 5       5        0.9454712
[15,] 6       1        0.8789223
[16,] 6       2        0.7593188
[17,] 6       3        0.7090342
[18,] 6       4        0.7150963
[19,] 6       5        0.9857516
[20,] 6       6        0.9189523

Summary table for association with clinical variable for K=2,3,4,5,6 (only batch is removed)

K	2	3	4	5	6
Age	0.78	0.88	0.7	0.6345	0.444
Gender	0.12	0.05	2.02e-03	5.35e-03	8.097155e-03
Rectal/colon	0.0016	4.74e-04	4.44e-03	7.41e-03	2.2e-03
Tumor stage	0.36	0.43	0.23	0.3	0.58
BRAF	1.33e-05	1.03e-11	1.67e-20	1.01e-19	5.15e-19
KRAS	3.87e-03	0.04	6.7e-06	1.22e-04	5.97e-06
KRAS type	1.97e-02	0.24	7.14e-03	1.59e-02	2.32e-03
TP53	0.96	3.45e-06	6.86e-03	1.75e-03	1.26e-03
MLH1 mutation	1.26e-04	5.11e-10	1.45e-20	8.96e-20	4.12e-19

Gender and age are removed, ConsensusClusterPlus, K means, Pearson correlation:

Cluster consensus:

> icl[["clusterConsensus"]]
      k cluster clusterConsensus
 [1,] 2       1        0.9008927
 [2,] 2       2        0.9411635
 [3,] 3       1        0.8680856
 [4,] 3       2        0.8760303
 [5,] 3       3        0.7333630
 [6,] 4       1        0.7745014
 [7,] 4       2        0.7901681
 [8,] 4       3        0.8024310
 [9,] 4       4        0.8247401
[10,] 5       1        0.9153414
[11,] 5       2        0.8141924
[12,] 5       3        0.6573712
[13,] 5       4        0.6708655
[14,] 5       5        0.6319631
[15,] 6       1        0.8695192
[16,] 6       2        0.8568005
[17,] 6       3        0.7128842
[18,] 6       4        0.7407073
[19,] 6       5        0.5585342
[20,] 6       6        0.7426816

K	2	3	4	5	6
Tumor stage	0.36	0.44	0.66	0.83	0.76
Rectal/colon	8.69e-03	1.68e-02	6.12e-02	0.15	0.13
BRAF	9.18e-05	1.46e-12	1.81e-12	1.16e-13	1.63e-10
KRAS	2.41e-02	3.36e-02	5.51e-02	2.53e-04	3.32e-03
KRAS type	6.98e-02	3.705241e-01	0.55	9.23e-03	0.46
TP53	0.72	5.90e-03	4.15e-03	1.40e-02	2.14e-02
MLH1 methyl.	9.19e-05	1.46e-12	2.01e-10	1.03e-13	1.63e-10