Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
27k platform - how to identify probes that overlap with known SNPs and repetitive regions
May 17, 2012
According to these publications here and here when analyzing Illumina Human Methylation 27k arrays probes should be removed from consideration if they:
Contain SNPs within the 5 base pairs from the interrogated CpG site
Overlap with a repetitive element
Not uniquely mapped to the human genome (which is because they overlap with the repetitive region?)
Overlap with the known regions of insertions and deletions
Are on X,Y chromosomes
Have P values => 0.05
While 5 and 6 are very easy to do 1-4 are not very easy without any experience. I found that following code snippets from the vignette for R/Bioconductor package IlluminaHumanMethylation450kprobe makes it very easy. Also, Tim Triche is a very helpful person (author of the package)! Since some of the code is a little outdated I am posting here an updated version of the key functions.
Get the genomic locations of each interrogated CpG probe. Add 5 bp around the position. I will look for SNPs within this region since it will affect probe hybridization.
Here is the parallelized version of identifying the SNPs using library parallel. It returns CpG ID and associated SNP ID.
CpG.snps.by.chr = mclapply(chs, function(ch) {
snps = getSNPlocs(paste('ch', ch, sep=''), as.GRanges=TRUE)
seqlevels(snps) <- gsub('ch','chr',seqlevels(snps))
names(snps) = paste('rs',elementMetadata(snps)$RefSNP_id,sep='')
message(paste('Scanning for CpG SNPs in probes on chromosome', ch))
overlapping = findOverlaps(CpGs.27k, snps)
overlap = as.matrix(overlapping) #Originally I used matchMatrix but this function seems to be deprecated
results = data.frame(Probe_ID=as.character(names(CpGs.27k)[overlap[,1]]), rsID=as.character(names(snps)[overlap[,2]]))
return(results)
})
source("getSNPCpGprobeOverlap.R")
Combine results into a nice looking data frame. Total number of CpGs identified is 2894, some of them are repetitive = 239. So the total to be removed because of the SNPs is ~2500.
Run the following function. RM = repeat masker track, TRF = tandem repeats finder
CpG.rpts.by.chr = mclapply(chs, function(ch){
chr = Hsapiens[[paste('chr',ch,sep='')]]
rpts = union(masks(chr)$RM, masks(chr)$TRF)
probes = which(seqnames(CpGs.27k)==paste('chr',ch,sep=''))
# note how we have to use RangedData instead!!
CpGs.chr = ranges(CpGs.27k[probes])
#message(paste('Scanning for repeats in CpG sites on chromosome', ch))
overlapping = findOverlaps(CpGs.chr, rpts)
overlap = as.matrix(overlapping)
results = data.frame(Probe_ID=as.character(names(CpGs.chr)[overlap[,1]]))
return(results)
})
source("getRepetitiveREgionCpGProbeOverlap.R")
Note: I am looking only within 5 bp of the CpG site, not the entire probe. It might be wrong. The function returns the list of CpGs that were mapped to the repetitive regions on each chromosome. Combine everything into a nice data frame. Total number of CpGs identified: 2716. Seems that there is one duplicate: