...
They provided the sequences for some of the markers and indicated the publications from which they got the sequence for the other markers for the methylation assays. Use them to map the CpG regions they assays. Map them to hg19:Since it was too much pain to retrieve the sequences for the primers I ended up retrieving CpGs near the genes listed and then finding only the CpGs within CpG islands using Illumina's HMM predictions (part of the Bioconductor package). The reason for that is because I am pretty sure in 2007 they focused on the promoter regions and CpG islands are only within promoter regions of the genes. For the MINTs I mapped the sequences to hg19 (find the table here) and then found all CpGs on 450k platform within those sequences.
Number of CpGs profiles on 450k platform for every gene listed:
Code Block | ||
---|---|---|
| ||
> tapply(x$Symbol,as.factor(x$Symbol),length)
COX2/PTGS2 DAPK/DAPK1 ERalpha/ESR1 HPP1/TMEEF2 MGMT MLH1
17 29 68 17 176 7
Megalin/LRP2 MyoD1 N33/TUSC3 Neurog1 RASSF1A RIZ1/PRDM2
21 16 21 31 51 66
RUNX3 SFRP1 SOCS1 TERT THBS1 THBS2
91 37 28 100 33 57
TIMP3 p14/CDKN2D p16/CDKN2A
NA 3 3 |
No CpGs were found for TIMP3. Exclude from further analyses.
Now the number of CpGs within CpG island for every gene based on Illumina's HMM predictions:
Code Block | ||
---|---|---|
| ||
> tapply(x$Is_HMM_Island,as.factor(x$Symbol),function(x) length(which(x==1)))
COX2/PTGS2 DAPK/DAPK1 ERalpha/ESR1 HPP1/TMEEF2 MGMT MLH1
12 11 24 7 142 0
Megalin/LRP2 MyoD1 N33/TUSC3 Neurog1 RASSF1A RIZ1/PRDM2
13 16 7 24 33 32
RUNX3 SFRP1 SOCS1 TERT THBS1 THBS2
59 12 23 87 15 42
TIMP3 p14/CDKN2D p16/CDKN2A
NA 3 3 |
Somehow MLH1 doesn't have any CpGs that are predicted to be within a CpG island. So I manually looked at the CpGs that were pulled down for this gene to see if they are within the promoter region.
Looking for CpGs within MINT regions:
Code Block | ||
---|---|---|
| ||
> mint1<-subset(chrloc, Chromosome_37=="5" & Coordinate_37<75380277 & Coordinate_37>75379746)
> mint1
Probe_ID Chromosome_37 Coordinate_37
200977 cg10583931 5 75379800
469458 cg26904700 5 75379748
##MINT2:
> mint2
Probe_ID Chromosome_37 Coordinate_37
185088 cg09671962 2 58655463
188239 cg09859179 2 58655104
456366 cg26154670 2 58655041
##MINT27
mint27<-subset(chrloc, Chromosome_37=="10" & Coordinate_37>46970744 & Coordinate_37<46970985)
> mint27
Probe_ID Chromosome_37 Coordinate_37
336461 cg18513313 10 46970868
##MINT12:
> mint12<-subset(chrloc, Chromosome_37=="7" & Coordinate_37>129424950 & Coordinate_37<129425501)
> mint12
Probe_ID Chromosome_37 Coordinate_37
28893 cg01390574 7 129425451
356420 cg19776616 7 129425330
##MINT31:
> mint31<-subset(chrloc, Chromosome_37=="17" & Coordinate_37>48636113 & Coordinate_37<48636784)
> mint31
Probe_ID Chromosome_37 Coordinate_37
157895 cg08133931 17 48636626
266041 cg14315444 17 48636344
309320 cg16829453 17 48636646
##MINT17:
> mint17<-subset(chrloc, Chromosome_37=="12" & Coordinate_37>117316933 & Coordinate_37<117317411)
> mint17
Probe_ID Chromosome_37 Coordinate_37
291793 cg15824316 12 117317338 |
Total number of CpGs is 582. However there are 4 duplicate CpG IDs probably because they are within the promoter regions of a couple of genes. I decided to kick them out from the analysis. The final working set of CpGs is 578.