For the pilot project we want to recreate the results obtained in Shen, 2007 paper. To do that I need to identify IDs of the CpGs that they used in their analyses. Here is the list:
Classical CIMP markers: MINT1, MINT2, MINT27, MINT12, MINT31, MINT17
Type C genes (methylated only in cancer): p14 (9p21), MLH1 (2p22), THBS1 (15q14), THBS2 (6q), MGMT (10q26), COX2 (1q25.2), Megalin (1q21.1), RIZ1 (1p36), p16 (9p21), RASSF1A (I remember from my PhD work that this was my very good positive control for methylation in cancer cell lines; 3p21.31), DAPK (9p34.1), TIMP3 (22q12.1), hTERT (5p15.33), Neurog1 (5q31.1), SOCS1 (16p13.13), RUNX3 (1p36.11)
Type A genes (methylated in normal and cancer): ER alpha (6q25.1), MyoD1 (11p15.4), N33 (8p22), SFRP1 (8p12), HPP1 (2q32.3)
They provided the sequences for some of the markers and indicated the publications from which they got the sequence for the other markers for the methylation assays. Since it was too much pain to retrieve the sequences for the primers I ended up retrieving CpGs near the genes listed and then finding only the CpGs within CpG islands using Illumina's HMM predictions (part of the Bioconductor package). The reason for that is because I am pretty sure in 2007 they focused on the promoter regions and CpG islands are only within promoter regions of the genes. For the MINTs I mapped the sequences to hg19 (find the table here) and then found all CpGs on 450k platform within those sequences.
Number of CpGs profiles on 450k platform for every gene listed:
> tapply(x$Symbol,as.factor(x$Symbol),length) COX2/PTGS2 DAPK/DAPK1 ERalpha/ESR1 HPP1/TMEEF2 MGMT MLH1 17 29 68 17 176 7 Megalin/LRP2 MyoD1 N33/TUSC3 Neurog1 RASSF1A RIZ1/PRDM2 21 16 21 31 51 66 RUNX3 SFRP1 SOCS1 TERT THBS1 THBS2 91 37 28 100 33 57 TIMP3 p14/CDKN2D p16/CDKN2A NA 3 3 |
No CpGs were found for TIMP3. Exclude from further analyses.
Now the number of CpGs within CpG island for every gene based on Illumina's HMM predictions:
> tapply(x$Is_HMM_Island,as.factor(x$Symbol),function(x) length(which(x==1))) COX2/PTGS2 DAPK/DAPK1 ERalpha/ESR1 HPP1/TMEEF2 MGMT MLH1 12 11 24 7 142 0 Megalin/LRP2 MyoD1 N33/TUSC3 Neurog1 RASSF1A RIZ1/PRDM2 13 16 7 24 33 32 RUNX3 SFRP1 SOCS1 TERT THBS1 THBS2 59 12 23 87 15 42 TIMP3 p14/CDKN2D p16/CDKN2A NA 3 3 |
Somehow MLH1 doesn't have any CpGs that are predicted to be within a CpG island. So I manually looked at the CpGs that were pulled down for this gene to see if they are within the promoter region.
Looking for CpGs within MINT regions:
> mint1<-subset(chrloc, Chromosome_37=="5" & Coordinate_37<75380277 & Coordinate_37>75379746) > mint1 Probe_ID Chromosome_37 Coordinate_37 200977 cg10583931 5 75379800 469458 cg26904700 5 75379748 ##MINT2: > mint2 Probe_ID Chromosome_37 Coordinate_37 185088 cg09671962 2 58655463 188239 cg09859179 2 58655104 456366 cg26154670 2 58655041 ##MINT27 mint27<-subset(chrloc, Chromosome_37=="10" & Coordinate_37>46970744 & Coordinate_37<46970985) > mint27 Probe_ID Chromosome_37 Coordinate_37 336461 cg18513313 10 46970868 ##MINT12: > mint12<-subset(chrloc, Chromosome_37=="7" & Coordinate_37>129424950 & Coordinate_37<129425501) > mint12 Probe_ID Chromosome_37 Coordinate_37 28893 cg01390574 7 129425451 356420 cg19776616 7 129425330 ##MINT31: > mint31<-subset(chrloc, Chromosome_37=="17" & Coordinate_37>48636113 & Coordinate_37<48636784) > mint31 Probe_ID Chromosome_37 Coordinate_37 157895 cg08133931 17 48636626 266041 cg14315444 17 48636344 309320 cg16829453 17 48636646 ##MINT17: > mint17<-subset(chrloc, Chromosome_37=="12" & Coordinate_37>117316933 & Coordinate_37<117317411) > mint17 Probe_ID Chromosome_37 Coordinate_37 291793 cg15824316 12 117317338 |
Total number of CpGs is 582. However there are 4 duplicate CpG IDs probably because they are within the promoter regions of a couple of genes. I decided to kick them out from the analysis. The final working set of CpGs is 578.