Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
TCGA now provides DNA methylation data as .idat files - Illumina's proprietary file format. For each patient 2 files are provided: "red color" (Cy3) scans and "green color" (Cy5). To my knowledge there are two ways how these files can be decoded: reading them into GenomeStudio (Illumina's software) or using Bioconductor minfi package that is designed to read Illumina Human Methylation 450k arrays. Since GenomeStudio copy is not available to me I used minfi package.
Steps:
Download the data
Use .sdrf file provided with the downloaded data to make the targets file that facilitates data reading using minfi functions into R Here are the description of the columns in .sdrf file:
> colnames(sdrf)
[1] "Extract.Name" #patient ID in this format: TCGA-A6-4107-01A-02D-1407-05
[2] "Protocol.REF" # looks like this: jhu-usc.edu:labeling:HumanMethylation450:01
[3] "Labeled.Extract.Name" # the same as Extract Name
[4] "Label" #Cy5 or Cy3
[5] "Term.Source.REF" #MGED Ontology
[6] "Protocol.REF.1"#looks like this: jhu-usc.edu:hybridization:HumanMethylation450:01
[7] "Hybridization.Name" #patient IDs
[8] "Array.Design.REF" #one value: "Illumina.com:PhysicalArrayDesign:HumanMethylation450"
[9] "Term.Source.REF.1" #one calue: caArray
[10] "Protocol.REF.2" #one value: jhu-usc.edu:image_acquisition:HumanMethylation450:01
[11] "Protocol.REF.3" #one value: jhu-usc.edu:feature_extraction:HumanMethylation450:01
[12] "Scan.Name" #patient IDs
[13] "Array.Data.File" #Actual file name corresponding to patient IDs
[14] "Comment..TCGA.Archive.Name." #like this: jhu-usc.edu_COAD.HumanMethylation450.Level_1.1.1.0, 7 levels. 7 Batches?
[15] "Comment..TCGA.Data.Type." #one value: DNA methylation
[16] "Comment..TCGA.Data.Level." #one value: Level 1
[17] "Comment..TCGA.Include.for.Analysis." #one value: yes
[18] "Protocol.REF.4" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01
[19] "Normalization.Name" #patien IDs
[20] "Derived.Array.Data.Matrix.File" #names for the Level 2 data!Like this: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-2.TCGA-A6-4107-01A-02D-1407-05.txt
[21] "Comment..TCGA.Archive.Name..1" #looks like this jhu-usc.edu_COAD.HumanMethylation450.Level_2.1.1.0, 7 levels
[22] "Comment..TCGA.Data.Type..1" #one value: DNA methylation
[23] "Comment..TCGA.Data.Level..1" #one value: Level 2
[24] "Comment..TCGA.Include.for.Analysis..1" #one value: yes
[25] "Protocol.REF.5" #one value: jhu-usc.edu:within_bioassay_data_set_function:HumanMethylation450:01
[26] "Normalization.Name.1" #patient IDs
[27] "Derived.Array.Data.Matrix.File.1" #names for the level 3 data: jhu-usc.edu_COAD.HumanMethylation450.1.lvl-3.TCGA-A6-4107-01A-02D-1407-05.txt
[28] "Comment..TCGA.Archive.Name..2" #looks like this: jhu-usc.edu_COAD.HumanMethylation450.Level_3.1.1.0, 7 levels
[29] "Comment..TCGA.Data.Type..2" #one value: DNA methylation
[30] "Comment..TCGA.Data.Level..2" #one value: Level 3
[31] "Comment..TCGA.Include.for.Analysis..2" #one value: yes
Here are a few lines from the final targets file. The only important column is Basename that indicates where the files are for every patients if _Grn.idat or _Red.idat can be attached. I have other information because it is convenient to have it later for data normalization.
Sample_Name = patient barcode
Sample_Well = center (2nd field in the barcode)
Sample_Plate = BCR barcode (6th field in the barcode)
Sample_Group = sample type (normal/tumor)
Pool_ID = processing batch (TCGA archive name)
Array = array on a slide
Slide = slide that contains 12 arrays
Basename = path to the directory with files
3. Read the files
> library(minfi)
> RGset<-read.450k.exp(targets=targets)
#Get the methylation matrix that will allow to extract methylated and unmethylated probes:
> Mset.raw<-preprocessRaw(RGset) #this doesn't apply any normalization although the package has a few methods available including the methods implemented in GenomeStudio
It took about 10 min to read all files on Belltown (R-2.14.1). Got a total number of 247 patients.