View Source

TCGA now provides DNA methylation data as .idat files - Illumina's proprietary file format. For each patient 2 files are provided: "red color" (Cy3) scans and "green color" (Cy5). To my knowledge there are two ways how these files can be decoded: reading them into GenomeStudio (Illumina's software) or using Bioconductor minfi package that is designed to read Illumina Human Methylation 450k arrays. Since GenomeStudio copy is not available to me I used minfi package.

Steps:

Download the data
Use .sdrf file provided with the downloaded data to make the targets file that facilitates data reading using minfi functions into R
Here are the description of the columns in .sdrf file:

> colnames(sdrf)
 [1] "Extract.Name" #patient ID in this format: TCGA-A6-4107-01A-02D-1407-05                
 [2] "Protocol.REF" # looks like this: jhu-usc.edu:labeling:HumanMethylation450:01              
 [3] "Labeled.Extract.Name" # the same as Extract Name                
 [4] "Label"  #Cy5 or Cy3                              
 [5] "Term.Source.REF" #MGED Ontology                  
 [6] "Protocol.REF.1"#looks like this: jhu-usc.edu:hybridization:HumanMethylation450:01                
 [7] "Hybridization.Name" #patient IDs               
 [8] "Array.Design.REF" #one value: "Illumina.com:PhysicalArrayDesign:HumanMethylation450"              
 [9] "Term.Source.REF.1" #one calue: caArray                   
[10] "Protocol.REF.2" #one value: jhu-usc.edu:image_acquisition:HumanMethylation450:01                
[11] "Protocol.REF.3" #one value: jhu-usc.edu:feature_extraction:HumanMethylation450:01              
[12] "Scan.Name"  #patient IDs                      
[13] "Array.Data.File" #Actual file name corresponding to patient IDs                   
[14] "Comment..TCGA.Archive.Name."  #like this: jhu-usc.edu_COAD.HumanMethylation450.Level_1.1.1.0, 7 levels. 7 Batches?        
[15] "Comment..TCGA.Data.Type." #one value: DNA methylation         
[16] "Comment..TCGA.Data.Level."  #one value: Level 1          
[17] "Comment..TCGA.Include.for.Analysis."  #one value: yes
[18] "Protocol.REF.4" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01                    
[19] "Normalization.Name"   #patien IDs                
[20] "Derived.Array.Data.Matrix.File" #names for the Level 2 data!Like this: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-2.TCGA-A6-4107-01A-02D-1407-05.txt     
[21] "Comment..TCGA.Archive.Name..1" #looks like this jhu-usc.edu_COAD.HumanMethylation450.Level_2.1.1.0, 7 levels       
[22] "Comment..TCGA.Data.Type..1" #one value: DNA methylation       
[23] "Comment..TCGA.Data.Level..1"  #one value: Level 2        
[24] "Comment..TCGA.Include.for.Analysis..1" #one value: yes
[25] "Protocol.REF.5"  #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01                     
[26] "Normalization.Name.1" #patient IDs             
[27] "Derived.Array.Data.Matrix.File.1" #names for the level 3 data: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-3.TCGA-A6-4107-01A-02D-1407-05.txt
[28] "Comment..TCGA.Archive.Name..2"  #looks like this: jhu-usc.edu_COAD.HumanMethylation450.Level_3.1.1.0, 7 levels    
[29] "Comment..TCGA.Data.Type..2" #one value: DNA methylation  
[30] "Comment..TCGA.Data.Level..2" #one value: Level 3         
[31] "Comment..TCGA.Include.for.Analysis..2" #one value: yes

Here are a few lines from the final targets file. The only important column is Basename that indicates where the files are for every patients if _Grn.idat or _Red.idat can be attached. I have other information because it is convenient to have it later for data normalization.

>head(targets)
Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID  Array      Slide
1 TCGA-A6-4107-01A-02D-1407-05          A6         1407        tumor   1.1.0 R01C01 5775041065
2 TCGA-AA-3510-01A-01D-1407-05          AA         1407        tumor   1.1.0 R02C01 5775041065
3 TCGA-AZ-4308-01A-01D-1407-05          AZ         1407        tumor   1.1.0 R03C01 5775041065
4 TCGA-A6-2679-01A-02D-1407-05          A6         1407        tumor   1.1.0 R04C01 5775041065
5 TCGA-AZ-4682-01B-01D-1407-05          AZ         1407        tumor   1.1.0 R05C01 5775041065
6 TCGA-AA-3492-01A-01D-1407-05          AA         1407        tumor   1.1.0 R06C01 5775041065
                                                                Basename
1 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R01C01
2 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R02C01
3 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R03C01
4 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R04C01
5 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R05C01
6 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R06C01

Colnames descriptions:

Sample_Name = patient barcode
Sample_Well = center (2nd field in the barcode)
Sample_Plate = BCR barcode (6th field in the barcode)
Sample_Group = sample type (normal/tumor)
Pool_ID = processing batch (TCGA archive name)
Array = array on a slide
Slide = slide that contains 12 arrays
Basename = path to the directory with files

3. Read the files

> library(minfi) 
> RGset<-read.450k.exp(targets=targets)
#Get the methylation matrix that will allow to extract methylated and unmethylated probes:
> Mset.raw<-preprocessRaw(RGset) #this doesn't apply any normalization although the package has a few methods available including the methods implemented in GenomeStudio

It took about 10 min to read all files on Belltown (R-2.14.1). Got a total number of 247 patients.