In order to do test analysis of DNA methylation data I need to cluster the data based on 27 markers described in Shen, 2007. Seven out of those markers are MINTs and I need to figure out what would be their most likely location on the 450k array. I will use Bioconductor IlluminaHumanMethylation450k.db package for annotation (i.e. chromosome and location of the probes) but some of the features are confusing. Here I test what each annotation feature in the package really means.
IlluminaHumanMethylation450kCHR36: What chromosome does the target sequence for a probe align to, in build 36
IlluminaHumanMethylation450kCHR37: What chromosome does the target sequence for a probe align to, in build 37?
IlluminaHumanMethylation450kCHRLOC: IlluminaHumanMethylation450kCHRLOC is an R object that maps manufacturer identifiers to the starting position of the gene. The position of a gene is measured as the number of base pairs. The CHRLOCEND mapping is the same as the CHRLOC mapping except that it specifies the ending base of a gene instead of the start. Each manufacturer identifier maps to a named vector of chromosomal locations, where the name indicates the chromosome. Due to inconsistencies that may exist at the time the object was built, these vectors may contain more than one chromosome and/or location. If the chromosomal location is unknown, the vector will contain an NA. Chromosomal locations on both the sense and antisense strands are measured as the number of base pairs from the p (5’ end of the sense strand) to q (3’ end of the sense strand) arms. Chromosomal locations on the antisense strand have a leading "-" sign (e. g. -1234567). Since some genes have multiple start sites, this field can map to multiple locations. Mappings were based on data provided by: UCSC Genome Bioinformatics (Homo sapiens) ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 With a date stamp from the source of: 2010-Mar22
IlluminaHumanMethylation450kCHR: IlluminaHumanMethylation450kCHR is an R object that provides mappings between a manufacturer identifier and the chromosome that contains the gene of interest. Each manufacturer identifier maps to a vector of chromosomes. Due to inconsistencies that may exist at the time the object was built, the vector may contain more than one chromosome (e.g., the identifier may map to more than one chromosome). If the chromosomal location is unknown, the vector will contain an NA. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7
IlluminaHumanMethylation450kCPG36: CpG location annotations against genome build 36
IlluminaHumanMethylation450kCPG37: CpG location annotations against genome build 37
IlluminaHumanMethylation450kCPGCOORDINATE: IlluminaHumanMethylation450kCPGCOORDINATE is an R object that provides mappings between a manufacturer identifier and the CpG coordinate as defined by Illumina in the manifest. Simple probe mapping to CpG coordinate within the human genome, Build 36, as defined by Illumina. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 19-Sep-2009
IlluminaHumanMethylation450kCPGILOCATION: Map a probe ID to CpG island with which it is associated, and find what sort of relationship the probe has to said island (in it, shore, etc.) Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011
IlluminaHumanMethylation450kCPGINAME: Map a probe ID to the CpG island with which it is associated. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011
IlluminaHumanMethylation450kCPGIRELATION: What relationship does a probe have to the nearest UCSC CpG island? Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011.
IlluminaHumanMethylation450kCPGS: IlluminaHumanMethylation450kCPGS is an R object that provides mappings between Illumina probe IDs and the number of CpGs in the source sequence targeted by Illumina, based on information in their manifest. Probe mapping to CpG density within the human genome, Build 36, as defined by Illumina. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011.
IlluminaHumanMethylation450kDMR: IlluminaHumanMethylation450kDMR provides mappings to Illumina’s DMR annotations for the 450k probes, which indicate what sort of differentially methylated region is associated with the probe target sequence (if any).
IlluminaHumanMethylation450kENHANCER: IlluminaHumanMethylation450kENHANCER provides mappings to Illumina’s ENHANCER annotations for the 450k probes (which in turn were suggested by Ben Berman). Illumina/UCSC enhancer binding site annotations. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011
IlluminaHumanMethylation450kENTREZID: Each manufacturer identifier is mapped to a vector of Entrez Gene identifiers. An NA is assigned to those manufacturer identifiers that can not be mapped to an Entrez Gene identifier at this time. If a given manufacturer identifier can be mapped to different Entrez Gene identifiers from various sources, we attempt to select the common identifiers. If a consensus cannot be determined, we select the smallest identifier. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7
IlluminaHumanMethylation450kGENENAME: IlluminaHumanMethylation450kGENENAME is an R object that maps manufacturer identifiers to the corresponding gene name. Each manufacturer identifier maps to a named vector containing the gene name. The vector name corresponds to the manufacturer identifier. If the gene name is unknown, the vector will contain an NA. Gene names currently include both the official (validated by a nomenclature committee) and preferred names (interim selected for display) for genes. Efforts are being made to differentiate the two by adding a name to the vector. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7
IlluminaHumanMethylation450kISCPGISLAND: IlluminaHumanMethylation450kISCPGISLAND is an R object that provides mappings between a manufacturer identifier and whether or not the CpG site being assayed meets Illumina’s criteria for being in a CpG island as defined by Illumina in the manifest. Value is a simple TRUE or FALSE for each probe, based simply on the Illumina manifest, TRUE meaning that the CpG site is within a CpG island. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 19-Sep-2009
IlluminaHumanMethylation450kMAP: IlluminaHumanMethylation450kMAP is an R object that provides mappings between manufacturer identifiers and cytoband locations.
IlluminaHumanMethylation450kPROBELOCATION: IlluminaHumanMethylation450kPROBELOCATION is an R object that maps between Illumina probe IDs and the transcript(s) they target, along with a description of where in the associated gene the interrogated CpG site happens to be. Map a probe ID to all of the transcripts it is associated with, and where in the corresponding gene (body, 5’ UTR, 3’ UTR, within 1500bp of the TSS, etc.) the probe target sequence aligns. Mappings were based on data provided by: Illumina http://Illumina.com Downloaded 11-Jan-2011
IlluminaHumanMethylation450kREFSEQ: IlluminaHumanMethylation450kREFSEQ is an R object that provides mappings between manufacturer identifiers and RefSeq identifiers. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7
IlluminaHumanMethylation450kSYMBOL: Each manufacturer identifier is mapped to an abbreviation for the corresponding gene. An NA is reported if there is no known abbreviation for a given gene. Symbols typically consist of 3 letters that define either a single gene (ABC) or multiple genes (ABC1, ABC2, ABC3). Gene symbols can be used as key words to query public databases such as Entrez Gene. Mappings were based on data provided by: Entrez Gene ftp://ftp.ncbi.nlm.nih.gov/gene/DATA With a date stamp from the source of: 2010-Sep7
After comparing these annotations with the adf file provided by TCGA for the array I deduced that *CHR37 and *CPG37 provide chromosome and chromosomal location of the probes on the array. Alternatively, the adf file might be used for the annotations instead of this package although it does lack some of the useful annotations.