TCGA now provides DNA methylation data as .idat files - Illumina's proprietary file format. For each patient 2 files are provided: "red color" (Cy3) scans and "green color" (Cy5). To my knowledge there are two ways how these files can be decoded: reading them into GenomeStudio (Illumina's software) or using Bioconductor minfi package that is designed to read Illumina Human Methylation 450k arrays. Since GenomeStudio copy is not available to me I used minfi package.
Steps:
> colnames(sdrf) [1] "Extract.Name" #patient ID in this format: TCGA-A6-4107-01A-02D-1407-05 [2] "Protocol.REF" # looks like this: jhu-usc.edu:labeling:HumanMethylation450:01 [3] "Labeled.Extract.Name" # the same as Extract Name [4] "Label" #Cy5 or Cy3 [5] "Term.Source.REF" #MGED Ontology [6] "Protocol.REF.1"#looks like this: jhu-usc.edu:hybridization:HumanMethylation450:01 [7] "Hybridization.Name" #patient IDs [8] "Array.Design.REF" #one value: "Illumina.com:PhysicalArrayDesign:HumanMethylation450" [9] "Term.Source.REF.1" #one calue: caArray [10] "Protocol.REF.2" #one value: jhu-usc.edu:image_acquisition:HumanMethylation450:01 [11] "Protocol.REF.3" #one value: jhu-usc.edu:feature_extraction:HumanMethylation450:01 [12] "Scan.Name" #patient IDs [13] "Array.Data.File" #Actual file name corresponding to patient IDs [14] "Comment..TCGA.Archive.Name." #like this: jhu-usc.edu_COAD.HumanMethylation450.Level_1.1.1.0, 7 levels. 7 Batches? [15] "Comment..TCGA.Data.Type." #one value: DNA methylation [16] "Comment..TCGA.Data.Level." #one value: Level 1 [17] "Comment..TCGA.Include.for.Analysis." #one value: yes [18] "Protocol.REF.4" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01 [19] "Normalization.Name" #patien IDs [20] "Derived.Array.Data.Matrix.File" #names for the Level 2 data!Like this: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-2.TCGA-A6-4107-01A-02D-1407-05.txt [21] "Comment..TCGA.Archive.Name..1" #looks like this jhu-usc.edu_COAD.HumanMethylation450.Level_2.1.1.0, 7 levels [22] "Comment..TCGA.Data.Type..1" #one value: DNA methylation [23] "Comment..TCGA.Data.Level..1" #one value: Level 2 [24] "Comment..TCGA.Include.for.Analysis..1" #one value: yes [25] "Protocol.REF.5" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01 [26] "Normalization.Name.1" #patient IDs [27] "Derived.Array.Data.Matrix.File.1" #names for the level 3 data: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-3.TCGA-A6-4107-01A-02D-1407-05.txt [28] "Comment..TCGA.Archive.Name..2" #looks like this: jhu-usc.edu_COAD.HumanMethylation450.Level_3.1.1.0, 7 levels [29] "Comment..TCGA.Data.Type..2" #one value: DNA methylation [30] "Comment..TCGA.Data.Level..2" #one value: Level 3 [31] "Comment..TCGA.Include.for.Analysis..2" #one value: yes |
Here are a few lines from the final targets file. The only important column is Basename that indicates where the files are for every patients if _Grn.idat or _Red.idat can be attached. I have other information because it is convenient to have it later for data normalization.
>head(targets) Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide 1 TCGA-A6-4107-01A-02D-1407-05 A6 1407 tumor 1.1.0 R01C01 5775041065 2 TCGA-AA-3510-01A-01D-1407-05 AA 1407 tumor 1.1.0 R02C01 5775041065 3 TCGA-AZ-4308-01A-01D-1407-05 AZ 1407 tumor 1.1.0 R03C01 5775041065 4 TCGA-A6-2679-01A-02D-1407-05 A6 1407 tumor 1.1.0 R04C01 5775041065 5 TCGA-AZ-4682-01B-01D-1407-05 AZ 1407 tumor 1.1.0 R05C01 5775041065 6 TCGA-AA-3492-01A-01D-1407-05 AA 1407 tumor 1.1.0 R06C01 5775041065 Basename 1 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R01C01 2 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R02C01 3 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R03C01 4 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R04C01 5 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R05C01 6 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R06C01 |
Colnames descriptions:
Sample_Name = patient barcode Sample_Well = center (2nd field in the barcode) Sample_Plate = BCR barcode (6th field in the barcode) Sample_Group = sample type (normal/tumor) Pool_ID = processing batch (TCGA archive name) Array = array on a slide Slide = slide that contains 12 arrays Basename = path to the directory with files
3. Read the files
> library(minfi) > RGset<-read.450k.exp(targets=targets) #Get the methylation matrix that will allow to extract methylated and unmethylated probes: > Mset.raw<-preprocessRaw(RGset) #this doesn't apply any normalization although the package has a few methods available including the methods implemented in GenomeStudio |
It took about 10 min to read all files on Belltown (R-2.14.1). Got a total number of 247 patients.