Attempt to reproduce the results in Shen 2007 using TCGA 450k data

The CpGs were selected based on the genes used in Shen, 2007 paper. Here is the list of genes and how the CpGs were selected on the platform. Final set of CpGs is 578.

Read the data. I had some concern about potential outliers, so looked at PC1 and PC2 vs sample group to identify unusual samples. PC2 didn't show any outliers, but on the plot of PC1 vs sample group I identified unusual tumor samples:
> library(corpcor) > u<-fast.svd(sweep(mvalM,1,rowMeans(mvalM))) > boxplot(u$v[,1]~tech$Sample_Group) > boxplot(u$v[,1]~tech$Sample_Group)$out [1] 0.08914879 -0.16326579 -0.12731540 0.10161632 -0.14924534 -0.13352175 [7] -0.17971343 0.09317337 -0.15414415 > names(boxplot(x~tech$Sample_Group)$out) [1] "5775041088_R06C01" "5775041088_R06C02" "6042308158_R03C01" "6042308158_R05C01" [5] "6042316008_R03C01" "6057825020_R05C01" "6264509018_R05C02" "6929718053_R06C02" [9] "6929718054_R04C01" > boxplot(x~tech$Sample_Group,ylab="PC1",main="Mval Colon Array, Slide, Age, Gender removed") > text(x=boxplot(x~tech$Sample_Group)$group,y=boxplot(x~tech$Sample_Group)$out,labels=names(boxplot(x~tech$Sample_Group)$out))
Identification of differentially methylated CpGs from the set of Shen's genes. In Shen,2007 they used Welch t-test to identify differentially methylated regions (DMR). Started with dmpFinder from the minfi package.
> dmp <- dmpFinder(mvalShen, pheno=tech$Sample_Group, type="categorical") > head(dmp) intercept f pval qval 307 -0.17219256 465.1685 7.140400e-57 5.193259e-55 306 -0.50635458 435.5644 1.030627e-54 3.747909e-53 299 -0.43932370 426.4806 4.956005e-54 1.201512e-52 305 -0.07570649 414.9318 3.768098e-53 6.851406e-52 310 -0.19271258 375.9914 4.661262e-50 6.780333e-49 408 0.45519864 340.9060 4.281293e-47 5.189687e-46 > table(dmp$pval<=0.05) FALSE TRUE 154 428
420 CpGs were identified as differentially methylated. I looked at how many CpGs for each were identified as differentially methylated and found that some CpGs that belong to genes such ERalpha, MyoD1, HPP1,N33, SFRP1 were also identified as DMR. In the paper those genes were considered to be associated with age and therefore hypermethylated in all cancer. I am surprised to have this signal because I did remove age and gender from the data. I lowered the p value cutoff from 0.05 to 0.001 and I still see CpGs from the "control" genes
> tapply(markSign$Symbol,as.factor(markSign$Symbol),length) COX2/PTGS2 DAPK/DAPK1 ERalpha/ESR1 HPP1/TMEEF2 Megalin/LRP2 MGMT NA 3 20 7 9 66 MINT1 MINT12 MINT17 MINT2 MINT27 MINT31 1 1 NA 3 1 NA MLH1 MyoD1 N33/TUSC3 Neurog1 p14/CDKN2D p16/CDKN2A 1 16 7 22 3 NA RASSF1A RIZ1/PRDM2 RUNX3 SFRP1 SOCS1 TERT 8 8 38 10 15 36 THBS1 THBS2 3 26
Cluster analysis of all Shen genes' CpGs. Cluster analysis of tumor samples only. In the paper they used hierarchical clustering with distance = euclidean and clustering method = average. Clustering results and a heatmap:
distTumor<-dist(t(mvalShenTumor)) tumor190average<-hclust(distTumor,method="average") plot(hclust(distTumor,method="average"),hang=-1,labels=F,xlab="Colon 190 tumor samples") ##code for the heatmap: heatmap.2(mvalShenTumor,hclustfun=function(x)hclust(x,method="average"),scale="none",trace="none",col=brewer.pal(n=11,"RdYlGn"))

Visually on the heatmap I don't see a clear separation between the clusters. Next test cluster stability using pvclust R library. First do 1000 bootstraps:
> library(pvclust) > result<-pvclust(mval,method.dist="euclidean",method.hclust="average") #default nboot=1000 > plot(result,labels=F)
It seems that it identified 9 clusters, the majority of them are very small, they look like hey have 2 patients maximum, one cluster seems bigger. I also looked at the plot of AU p values vs standard error
I looked at the p values and standard error for the selected clusters:
seplot(result,identify=TRUE) [1] 53 76 95 > print(result,which=c(53,76,95)) Cluster method: average Distance : euclidean Estimates on edges: au bp se.au se.bp v c pchi 53 0.871 0.005 0.117 0.001 0.719 1.851 0.506 76 0.883 0.002 0.204 0.001 0.827 2.016 0.944 95 0.853 0.001 0.303 0.001 0.968 2.018 0.686

According to the handout here: http://www.is.titech.ac.jp/~shimo/prog/pvclust/ I need to look at the p value +/- 2 SE for that pvalues: "Take the 21st cluster for example. AU p-value is estimated as 0.795, with SE (standard error) 0.733. By an analogue of standard normal theory, the "true" AU p-value is roughly estimated to exist in between (AU - 2 * SE) and (AU + 2 * SE). From this rough inference, the true AU p-value seems to be in between -0.671 and 2.261. However AU is defined to be between 0 and 1, so this inference is of no meaning"
So lets take a look at each of these clusters, for 53:
> 0.871-2*0.177 [1] 0.517 > 0.871+2*0.177 [1] 1.225 Unreal, now for 76: > 0.883+2*0.204 [1] 1.291 > 0.883-2*0.204 [1] 0.475
My understanding is that I need to increase number of bootstraps to improve standard error. I did 10,000. Clusters and p values vs standard errors:
It seems that after 10,000 I have fewer clusters (with alpha=0.95) and SE is smaller on overall.
> seplot(result,identify=T) [1] 74 81 90 99 104 112 > print(result, which=c(74, 81, 90, 99, 104, 112)) Cluster method: average Distance : euclidean Estimates on edges: au bp se.au se.bp v c pchi 74 0.887 0 0.112 0 1.055 2.263 0.700 81 0.860 0 0.147 0 1.129 2.210 0.258 90 0.853 0 0.155 0 1.213 2.262 0.316 99 0.957 0 0.141 0 0.973 2.688 0.738 104 0.815 0 0.373 0 1.407 2.302 0.817 112 0.954 0 0.110 0 1.006 2.694 0.958 ##More stable clusters: > pvpick(result,alpha=0.95) $clusters $clusters[[1]] [1] "6042316015_R02C02" "6929718079_R05C02" $clusters[[2]] [1] "6929718053_R05C01" "6929718065_R03C02" $clusters[[3]] [1] "5775041084_R06C01" "5775041088_R03C01" "6042308159_R03C01" [4] "6042316015_R02C01" "6042316015_R04C01" "6042316001_R05C01" [7] "6042316001_R03C02" "6042316011_R01C01" "6042316011_R02C01" [10] "6042316011_R05C01" "6057825035_R05C02" "6057825020_R02C01" [13] "6057825020_R01C02" "6057825020_R02C02" "6057825028_R01C01" [16] "6057825028_R02C01" "6057825028_R03C01" "6057825028_R01C02" [19] "6057825028_R02C02" "6057825002_R01C01" "6057825002_R02C01" [22] "6264509018_R01C01" "6264509018_R01C02" "6264509087_R02C01" [25] "6264509087_R03C01" "6264509127_R03C01" "6264496026_R06C02" [28] "6929718053_R01C01" "6929718053_R02C01" "6929718054_R03C01" [31] "6929718054_R01C02" "6929718054_R02C02" "6929718054_R03C02" [34] "6929718065_R02C01" "6929718065_R02C02" "6929718079_R01C01" [37] "6929718079_R03C01" "6929718079_R06C01" "6929718079_R04C02" [40] "6929718086_R01C02" $clusters[[4]] [1] "6042308158_R06C01" "6042316001_R04C01" "6929718054_R04C02" $clusters[[5]] [1] "6042308159_R04C02" "6042316001_R06C01" "6057825020_R03C01" [4] "6057825002_R03C01" "6264509087_R03C02" $clusters[[6]] [1] "6264509127_R02C01" "6929718053_R03C02" $clusters[[7]] [1] "5775041065_R01C01" "6264509087_R06C01" "6929718054_R06C01" $clusters[[8]] [1] "5775041088_R03C02" "6042308165_R03C01" "6042308165_R01C02" [4] "6042308159_R01C02" "6042308158_R01C02" "6042308158_R04C02" [7] "6042316001_R01C02" "6042316001_R02C02" "6042316011_R06C01" [10] "6042316010_R06C02" "6057825035_R04C02" "6057825020_R01C01" [13] "6057825020_R06C01" "6057825020_R03C02" "6057825020_R06C02" [16] "6057825028_R04C01" "6057825028_R05C01" "6057825002_R04C01" [19] "6057825002_R05C01" "6264509087_R01C01" "6264509087_R01C02" [22] "6264509087_R02C02" "6264509127_R05C01" "6929718053_R02C02" [25] "6929718053_R04C02" "6929718054_R01C01" "6929718065_R01C01" [28] "6929718065_R05C01" "6929718065_R01C02" "6929718079_R02C01" [31] "6929718079_R05C01" "6929718079_R01C02" "6929718079_R03C02" $clusters[[9]] [1] "6042316015_R03C01" "6929718079_R02C02" $edges [1] 21 39 76 78 86 89 107 112 129 allCluster<-pvpick(result,alpha=0.95)$clusters signCl<-unlist(allClusters) x<-rep("color",ncol(mval)) k<-match(signCl,colnames(mval)) x[k]<-"red" x[-k]<-"black" write.table(x,"ClusterColoring.txt",sep="\t") heatmap.2(mvalShenTumor,hclustfun=function(x) hclust(x,method="average"),scale="none",trace="none",col=brewer.pal(n=11,"RdYlGn"),symkey=F,labRow=F,labCol=F,ColSideColors=as.character(clCol[,1]))
I guess this is ok, identify the patients that went to stable clusters and mark them on the heatmap:
Analysis of the most variable CpGs from Shen's genes. Selected the most variable CpGs (based on standard deviation >= 0.5) out of 578 to see how it will affect the overall clustering. Also for the comparison the cluster the was published the in Shen, 2007 paper:

It seems that with this set of CpGs (my cluster) (corresponding genes are labeled on the side) I see more prominent clusters and differences in methylation between cluster. I did bootstrap analysis with 1000 bootstraps and the results weren't very impressive:

Next I repeated with 10,000 bootstraps and got better AU values and standard errors:
> seplot(result,identify=T) [1] 44 120 159 165 187 > print(result, which=c(44,120,159,165.187)) Cluster method: average Distance : euclidean Estimates on edges: au bp se.au se.bp v c pchi 44 0.910 0.001 0.040 0 0.855 2.197 0.698 120 0.910 0.000 0.097 0 1.014 2.354 0.643 159 0.868 0.001 0.060 0 1.036 2.153 0.619 165 0.935 0.000 0.137 0 0.995 2.505 0.125
5. One CpG per Shen's gene. Justin pointed out that my cluster is driven by CpGs from just a few genes. Next I selected one most variable CpG for each gene and made heatmaps (with average and complete clustering). Two genes (p16/CDKN2A and MINT31) were completely excluded because their CpGs were duplicates (shared with other genes).

It is very surprising to see that MINT27 seems to hypomethylated in all tumors, as well as ERalpha, and MyoD1. The last two were supposed to be the a-type genes, methylated in all cancers. I need to check the heatmap in datasets where age and gender weren't removed.

6. Analysis of the most variable CpGs in the same dataset where age and gender were not removed. Since I can't reproduce the result presented in Shen, 2007 another solution came up. They didn't not remove age and gender from their data, so I need to repeat that (keep age/gender) and see if I get better clusters.

Steps:

Data preparation. Load the M value dataset after removing intensity and color. Use snm to remove array and slide from the entire dataset
7 outliers were identified in the PC1 (not PC2) of the normalized data. For the sake of consistency remove the same 9 outliers that were taken out in the dataset above (7 belong to 9). Also remove cell lines and normal samples.
First, build a heatmap with the CpGs selected as the most variable in the analysis with age/gender/array/slide removed. Here is the heatmap and the result of the bootstrap analysis (nboot=10000)

It is clear that even after 10000 bootstaps the clusters are not very stable. It is even worse result than the one I got when I removed age and gender.
Build heatmap with the most variable CpGs per Shen gene selected based on the current normalization (only slide/array removed). Next to it is the results of the bootstap analysis with 10000 bootstraps. It is clear the the identified clusters are very unstable.

It is interesting to observe that in all cases MLH1 appears to be hypemethylated in ALL tumors even the most variable CpG (within the promoter region) was selected. In many papers describing CIMP phenotype hypermethylation of MLH1 is highly correlated with CIMP-H phenotype and mutation in BRAF gene (Nature 2006, Shen, 2007). Could it be that all these tumors are BRAF positive (mutated?). Unfortunately, we still don't have sequencing data.

Conclusions:

It seems that me that there is some difference in DNA methylation between multiple tumor cases, however, it is not driven but the same genes that were used for CIMP phenotype in Shen, 2007 paper.
Type-A genes appear to have the opposite methylation status in my dataset vs the results obtained in Shen, 2007. For example, they claim that ER2 alpha, MyoD1, HPP1 are hypermethylated in their dataset. In my dataset these genes appear to have the opposite signal. It would be useful to investigate methylation status of all CpGs related to these genes on the platform because although I assumed that in Shen, 2007 paper they used promoter regions I don't know that for a fact.
MLH1 appears to be hypermethylated in almost all tumor (even based on the most variable CpG on the platform). In Shen, 2007 paper I could see clear separation between several clusters based on MLH1 methylation. In fact, they claim that "the best single marker to predict CIMP1 group is hMLH1 methylation" (page 18656, right column, paragraph 3). This cluster is also characterized exclusively by BRAF gene mutation. In the other Nature, 2006 paper they also associated MLH1 hypermethylation with BRAF mutation. Although we don't have somatic mutation data for these patients, I wonder if they are all BRAF mutants. For the patients win 27k platform we had more variety (BRAF/KRAS mutations). Again, for this analysis I selected CpGs that were supposedly within the promoter region of MLH1 but it will be useful to investigate all CpGs that are associated with MLH1 on this platform.
Finally, in Nature 2006 paper they proposed another marker platform. I should take a look at it and see where I can get better results.
At this point it might be more productive to switch back to 27k platform which unfortunately has a big batch effect and they don't have the raw data uploaded (only processed in Genome Studio)