You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
Version 1
Next »
TCGA now provides DNA methylation data as .idat files - Illumina's proprietary file format. For each patient 2 files are provided: "red color" (Cy3) scans and "green color" (Cy5). To my knowledge there are two ways how these files can be decoded: reading them into GenomeStudio (Illumina's software) or using Bioconductor minfi package that is designed to read Illumina Human Methylation 450k arrays. Since GenomeStudio copy is not available to me I used minfi package.
Steps:
- Download the data
- Use .sdrf file provided with the downloaded data to make the targets file that facilitates data reading using minfi functions into R
Here are the description of the columns in .sdrf file:
> colnames(sdrf)
[1] "Extract.Name" #patient ID in this format: TCGA-A6-4107-01A-02D-1407-05
[2] "Protocol.REF" # looks like this: jhu-usc.edu:labeling:HumanMethylation450:01
[3] "Labeled.Extract.Name" # the same as Extract Name
[4] "Label" #Cy5 or Cy3
[5] "Term.Source.REF" #MGED Ontology
[6] "Protocol.REF.1"#looks like this: jhu-usc.edu:hybridization:HumanMethylation450:01
[7] "Hybridization.Name" #patient IDs
[8] "Array.Design.REF" #one value: "Illumina.com:PhysicalArrayDesign:HumanMethylation450"
[9] "Term.Source.REF.1" #one calue: caArray
[10] "Protocol.REF.2" #one value: jhu-usc.edu:image_acquisition:HumanMethylation450:01
[11] "Protocol.REF.3" #one value: jhu-usc.edu:feature_extraction:HumanMethylation450:01
[12] "Scan.Name" #patient IDs
[13] "Array.Data.File" #Actual file name corresponding to patient IDs
[14] "Comment..TCGA.Archive.Name." #like this: jhu-usc.edu_COAD.HumanMethylation450.Level_1.1.1.0, 7 levels. 7 Batches?
[15] "Comment..TCGA.Data.Type." #one value: DNA methylation
[16] "Comment..TCGA.Data.Level." #one value: Level 1
[17] "Comment..TCGA.Include.for.Analysis." #one value: yes
[18] "Protocol.REF.4" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01
[19] "Normalization.Name" #patien IDs
[20] "Derived.Array.Data.Matrix.File" #names for the Level 2 data!Like this: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-2.TCGA-A6-4107-01A-02D-1407-05.txt
[21] "Comment..TCGA.Archive.Name..1" #looks like this jhu-usc.edu_COAD.HumanMethylation450.Level_2.1.1.0, 7 levels
[22] "Comment..TCGA.Data.Type..1" #one value: DNA methylation
[23] "Comment..TCGA.Data.Level..1" #one value: Level 2
[24] "Comment..TCGA.Include.for.Analysis..1" #one value: yes
[25] "Protocol.REF.5" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01
[26] "Normalization.Name.1" #patient IDs
[27] "Derived.Array.Data.Matrix.File.1" #names for the level 3 data: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-3.TCGA-A6-4107-01A-02D-1407-05.txt
[28] "Comment..TCGA.Archive.Name..2" #looks like this: jhu-usc.edu_COAD.HumanMethylation450.Level_3.1.1.0, 7 levels
[29] "Comment..TCGA.Data.Type..2" #one value: DNA methylation
[30] "Comment..TCGA.Data.Level..2" #one value: Level 3
[31] "Comment..TCGA.Include.for.Analysis..2" #one value: yes
Here are a few lines from the final targets file. The only important column is Basename that indicates where the files are for every patients if _Grn.idat or _Red.idat can be attached. I have other information because it is convenient to have it later for data normalization.
Sample_Name = patient barcode
Sample_Well = center (2nd field in the barcode)
Sample_Plate = BCR barcode (6th field in the barcode)
Sample_Group = sample type (normal/tumor)
Pool_ID = processing batch (TCGA archive name)
Array = array on a slide
Slide = slide that contains 12 arrays
Basename = path to the directory with files
Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide
1 TCGA-A6-4107-01A-02D-1407-05 A6 1407 tumor 1.1.0 R01C01 5775041065
2 TCGA-AA-3510-01A-01D-1407-05 AA 1407 tumor 1.1.0 R02C01 5775041065
3 TCGA-AZ-4308-01A-01D-1407-05 AZ 1407 tumor 1.1.0 R03C01 5775041065
4 TCGA-A6-2679-01A-02D-1407-05 A6 1407 tumor 1.1.0 R04C01 5775041065
5 TCGA-AZ-4682-01B-01D-1407-05 AZ 1407 tumor 1.1.0 R05C01 5775041065
6 TCGA-AA-3492-01A-01D-1407-05 AA 1407 tumor 1.1.0 R06C01 5775041065
Basename
1 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R01C01
2 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R02C01
3 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R03C01
4 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R04C01
5 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R05C01
6 ./DNA_Methylation/JHU_USC__HumanMethylation450/Level_1/5775041065_R06C01
3. Read the files
> library(minfi)
> RGset<-read.450k.exp(targets=targets)
It took about 10 min to read all files on Belltown (R-2.14.1). Got a total number of 247 patients.