Shen, 2007 CIMP CpGs on 450k platform
For the pilot project we want to recreate the results obtained in Shen, 2007 paper. To do that I need to identify IDs of the CpGs that they used in their analyses. Here is the list:
Classical CIMP markers: MINT1, MINT2, MINT27, MINT12, MINT31, MINT17
Type C genes (methylated only in cancer): p14 (9p21), MLH1 (2p22), THBS1 (15q14), THBS2 (6q), MGMT (10q26), COX2 (1q25.2), Megalin (1q21.1), RIZ1 (1p36), p16 (9p21), RASSF1A (I remember from my PhD work that this was my very good positive control for methylation in cancer cell lines; 3p21.31), DAPK (9p34.1), TIMP3 (22q12.1), hTERT (5p15.33), Neurog1 (5q31.1), SOCS1 (16p13.13), RUNX3 (1p36.11)
Type A genes (methylated in normal and cancer): ER alpha (6q25.1), MyoD1 (11p15.4), N33 (8p22), SFRP1 (8p12), HPP1 (2q32.3)
They provided the sequences for some of the markers and indicated the publications from which they got the sequence for the other markers for the methylation assays. Since it was too much pain to retrieve the sequences for the primers I ended up retrieving CpGs near the genes listed and then finding only the CpGs within CpG islands using Illumina's HMM predictions (part of the Bioconductor package). The reason for that is because I am pretty sure in 2007 they focused on the promoter regions and CpG islands are only within promoter regions of the genes. For the MINTs I mapped the sequences to hg19 (find the table here) and then found all CpGs on 450k platform within those sequences.
Number of CpGs profiles on 450k platform for every gene listed:
No CpGs were found for TIMP3. Exclude from further analyses.
Now the number of CpGs within CpG island for every gene based on Illumina's HMM predictions:
Somehow MLH1 doesn't have any CpGs that are predicted to be within a CpG island. So I manually looked at the CpGs that were pulled down for this gene to see if they are within the promoter region.
Looking for CpGs within MINT regions:
Total number of CpGs is 582. However there are 4 duplicate CpG IDs probably because they are within the promoter regions of a couple of genes. I decided to kick them out from the analysis. The final working set of CpGs is 578.