IlluminaHumanMethylation450k.db Bioconductor package

In order to do test analysis of DNA methylation data I need to cluster the data based on 27 markers described in Shen, 2007. Seven out of those markers are MINTs and I need to figure out what would be their most likely location on the 450k array. I will use Bioconductor IlluminaHumanMethylation450k.db package for annotation (i.e. chromosome and location of the probes) but some of the features are confusing. Here I test what each annotation feature in the package really means.

IlluminaHumanMethylation450kCHR36: What chromosome does the target sequence for a probe align to, in build 36

> x <- IlluminaHumanMethylation450kCHR36
> xx<-as.list(x)[1:3]
> xx
$cg00000029
[1] "16"
$cg00000108
[1] "3"
$cg00000109
[1] "3"

IlluminaHumanMethylation450kCHR37: What chromosome does the target sequence for a probe align to, in build 37?

> x <- IlluminaHumanMethylation450kCHR37
> xx <- as.list(x)[1:3]
> xx
$cg00000029
[1] "16"
$cg00000108
[1] "3"
$cg00000109
[1] "3"

IlluminaHumanMethylation450kCHRLOC: IlluminaHumanMethylation450kCHRLOC is an R object that maps manufacturer identiﬁers to the starting position of the gene. The position of a gene is measured as the number of base pairs. The CHRLOCEND mapping is the same as the CHRLOC mapping except that it speciﬁes the ending base of a gene instead of the start. Each manufacturer identiﬁer maps to a named vector of chromosomal locations, where the name indicates the chromosome. Due to inconsistencies that may exist at the time the object was built, these vectors may contain more than one chromosome and/or location. If the chromosomal location is unknown, the vector will contain an NA. Chromosomal locations on both the sense and antisense strands are measured as the number of base pairs from the p (5’ end of the sense strand) to q (3’ end of the sense strand) arms. Chromosomal locations on the antisense strand have a leading "-" sign (e. g. -1234567). Since some genes have multiple start sites, this ﬁeld can map to multiple locations. Mappings were based on data provided by: UCSC Genome Bioinformatics (Homo sapiens) ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 With a date stamp from the source of: 2010-Mar22

> x <- IlluminaHumanMethylation450kCHRLOC
> xx <- as.list(x)[1:3]
> xx
$cg00000029
      16 
53468350 
$cg00000108
       3        3 
37440967 37440967 
$cg00000109
        3         3 
171757417 171758343

IlluminaHumanMethylation450kCHR: IlluminaHumanMethylation450kCHR is an R object that provides mappings between a manufacturer identiﬁer and the chromosome that contains the gene of interest. Each manufacturer identiﬁer maps to a vector of chromosomes. Due to inconsistencies that may exist at the time the object was built, the vector may contain more than one chromosome (e.g., the identiﬁer may map to more than one chromosome). If the chromosomal location is unknown, the vector will contain an NA. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7

> x <- IlluminaHumanMethylation450kCHR
> xx <- as.list(x)[1:3]
> xx
$cg00000029
[1] "16"
$cg00000108
[1] "3"
$cg00000109
[1] "3"

IlluminaHumanMethylation450kCPG36: CpG location annotations against genome build 36

 > as.list(IlluminaHumanMethylation450kCPG36)[1:3]
$cg00000029
[1] 52025613
$cg00000108
[1] 37434210
$cg00000109
[1] 173398731

IlluminaHumanMethylation450kCPG37: CpG location annotations against genome build 37

> as.list(IlluminaHumanMethylation450kCPG37)[1:3]
$cg00000029
[1] 53468112
$cg00000108
[1] 37459206
$cg00000109
[1] 171916037 
> x<-toTable(IlluminaHumanMethylation450kCPG37)
> dim(x)
[1] 485577      2

IlluminaHumanMethylation450kCPGCOORDINATE: IlluminaHumanMethylation450kCPGCOORDINATE is an R object that provides mappings between a manufacturer identiﬁer and the CpG coordinate as deﬁned by Illumina in the manifest. Simple probe mapping to CpG coordinate within the human genome, Build 36, as deﬁned by Illumina. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 19-Sep-2009

> as.list(IlluminaHumanMethylation450kCPGCOORDINATE)[1:3]
$cg00000029
[1] 53468112
$cg00000108
[1] 37459206
$cg00000109
[1] 171916037
> x<-toTable(IlluminaHumanMethylation450kCPGCOORDINATE)
> dim(x)
[1] 485577      2

IlluminaHumanMethylation450kCPGILOCATION: Map a probe ID to CpG island with which it is associated, and ﬁnd what sort of relationship the probe has to said island (in it, shore, etc.) Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011

> as.list(IlluminaHumanMethylation450kCPGILOCATION)[1:3]
$cg00050873
[1] "chrY:9363680-9363943:N_Shore"
$cg00063477
[1] "chrY:22737825-22738052:S_Shelf"
$cg00121626
[1] "chrY:21664481-21665063:N_Shore" 
> x<-as.list(IlluminaHumanMethylation450kCPGILOCATION)
> names(x)[1:4]
[1] "cg00050873" "cg00063477" "cg00121626" "cg00212031"
> grep("cg00000109",names(x))
integer(0)
> grep("cg00000108",names(x))
integer(0)
> x<-toTable(IlluminaHumanMethylation450kCPGILOCATION)
> dim(x)
[1] 309465      2

IlluminaHumanMethylation450kCPGINAME: Map a probe ID to the CpG island with which it is associated. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011

> as.list(IlluminaHumanMethylation450kCPGINAME)[1:3]
$cg00050873
[1] "chrY:9363680-9363943"
$cg00063477
[1] "chrY:22737825-22738052"
$cg00121626
[1] "chrY:21664481-21665063"
> x<-toTable(IlluminaHumanMethylation450kCPGINAME)
> dim(x)
[1] 309465      2

IlluminaHumanMethylation450kCPGIRELATION: What relationship does a probe have to the nearest UCSC CpG island? Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011.

> x<-toTable(IlluminaHumanMethylation450kCPGIRELATION)
> dim(x)
[1] 309465      2
> head(x,3)
  cpgiview.Probe_ID cpgiview.relationship
1        cg00050873               N_Shore
2        cg00063477               S_Shelf
3        cg00121626               N_Shore

IlluminaHumanMethylation450kCPGS: IlluminaHumanMethylation450kCPGS is an R object that provides mappings between Illumina probe IDs and the number of CpGs in the source sequence targeted by Illumina, based on information in their manifest. Probe mapping to CpG density within the human genome, Build 36, as deﬁned by Illumina. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011.

 > x<-toTable(IlluminaHumanMethylation450kCPGS)
> dim(x)
[1] 485577      2
> head(x,3)
    Probe_ID CpGs
1 cg00035864    2
2 cg00050873    4
3 cg00061679    1

IlluminaHumanMethylation450kDMR: IlluminaHumanMethylation450kDMR provides mappings to Illumina’s DMR annotations for the 450k probes, which indicate what sort of differentially methylated region is associated with the probe target sequence (if any).

 > x<-toTable(IlluminaHumanMethylation450kDMR)
> head(x,3)
    Probe_ID  DMR
1 cg26198148 RDMR
2 cg26488634 CDMR
3 cg00023415 CDMR
> dim(x)
[1] 37337     2

IlluminaHumanMethylation450kENHANCER: IlluminaHumanMethylation450kENHANCER provides mappings to Illumina’s ENHANCER annotations for the 450k probes (which in turn were suggested by Ben Berman). Illumina/UCSC enhancer binding site annotations. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011

 > x<-toTable(IlluminaHumanMethylation450kENHANCER)
> dim(x)
[1] 485577      2
> head(x,3)
    Probe_ID enhancer
1 cg00035864        0
2 cg00050873        0
3 cg00061679        0

IlluminaHumanMethylation450kENTREZID: Each manufacturer identiﬁer is mapped to a vector of Entrez Gene identiﬁers. An NA is assigned to those manufacturer identiﬁers that can not be mapped to an Entrez Gene identiﬁer at this time. If a given manufacturer identiﬁer can be mapped to different Entrez Gene identiﬁers from various sources, we attempt to select the common identiﬁers. If a consensus cannot be determined, we select the smallest identiﬁer. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7

 > x<-toTable(IlluminaHumanMethylation450kENTREZID)
> dim(x)
[1] 331373      2
> head(x,3)
    probe_id gene_id
1 cg03123289       1
2 cg03630821       1
3 cg10734734       1
> tail(x,3)
         probe_id   gene_id
331371 cg14625636 100499483
331372 cg19180498 100499483
331373 cg25327452 100499483

IlluminaHumanMethylation450kGENENAME: IlluminaHumanMethylation450kGENENAME is an R object that maps manufacturer identiﬁers to the corresponding gene name. Each manufacturer identiﬁer maps to a named vector containing the gene name. The vector name corresponds to the manufacturer identiﬁer. If the gene name is unknown, the vector will contain an NA. Gene names currently include both the ofﬁcial (validated by a nomenclature committee) and preferred names (interim selected for display) for genes. Efforts are being made to differentiate the two by adding a name to the vector. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7

> x<-toTable(IlluminaHumanMethylation450kGENENAME)
> head(x,3)
    probe_id              gene_name
1 cg03123289 alpha-1-B glycoprotein
2 cg03630821 alpha-1-B glycoprotein
3 cg10734734 alpha-1-B glycoprotein
> dim(x)
[1] 331373      2

IlluminaHumanMethylation450kISCPGISLAND: IlluminaHumanMethylation450kISCPGISLAND is an R object that provides mappings between a manufacturer identiﬁer and whether or not the CpG site being assayed meets Illumina’s criteria for being in a CpG island as deﬁned by Illumina in the manifest. Value is a simple TRUE or FALSE for each probe, based simply on the Illumina manifest, TRUE meaning that the CpG site is within a CpG island. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 19-Sep-2009

> x<-toTable(IlluminaHumanMethylation450kISCPGISLAND)
> dim(x)
[1] 485577      2
> head(x,3)
    Probe_ID Is_HMM_Island
1 cg00035864             0
2 cg00050873             1
3 cg00061679             0

IlluminaHumanMethylation450kMAP: IlluminaHumanMethylation450kMAP is an R object that provides mappings between manufacturer identiﬁers and cytoband locations.

> x<-toTable(IlluminaHumanMethylation450kMAP)
> dim(x)
[1] 332836      2
> head(x,3)
    probe_id cytogenetic_location
1 cg03123289              19q13.4
2 cg03630821              19q13.4
3 cg10734734              19q13.4

IlluminaHumanMethylation450kPROBELOCATION: IlluminaHumanMethylation450kPROBELOCATION is an R object that maps between Illumina probe IDs and the transcript(s) they target, along with a description of where in the associated gene the interrogated CpG site happens to be. Map a probe ID to all of the transcripts it is associated with, and where in the corresponding gene (body, 5’ UTR, 3’ UTR, within 1500bp of the TSS, etc.) the probe target sequence aligns. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011

> x<-toTable(IlluminaHumanMethylation450kPROBELOCATION)
> head(x,3)
  probelocation.Probe_ID probelocation.location
1             cg00035864      NR_001550:TSS1500
2             cg00050873      NM_001164471:Body
3             cg00050873      NR_001553:TSS1500
> dim(x)
[1] 686365      2

IlluminaHumanMethylation450kREFSEQ: IlluminaHumanMethylation450kREFSEQ is an R object that provides mappings between manufacturer identiﬁers and RefSeq identiﬁers. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7

 > x<-toTable(IlluminaHumanMethylation450kREFSEQ)
> dim(x)
[1] 1233708       2
> head(x,3)
    probe_id accession
1 cg03123289 NM_130786
2 cg03630821 NM_130786
3 cg10734734 NM_130786

IlluminaHumanMethylation450kSYMBOL: Each manufacturer identiﬁer is mapped to an abbreviation for the corresponding gene. An NA is reported if there is no known abbreviation for a given gene. Symbols typically consist of 3 letters that deﬁne either a single gene (ABC) or multiple genes (ABC1, ABC2, ABC3). Gene symbols can be used as key words to query public databases such as Entrez Gene. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7

> x<-toTable(IlluminaHumanMethylation450kSYMBOL)
> dim(x)
[1] 331373      2
> head(x,3)
    probe_id symbol
1 cg03123289   A1BG
2 cg03630821   A1BG
3 cg10734734   A1BG

After comparing these annotations with the adf file provided by TCGA for the array I deduced that *CHR37 and *CPG37 provide chromosome and chromosomal location of the probes on the array. Alternatively, the adf file might be used for the annotations instead of this package although it does lack some of the useful annotations.