Datasets availability as for December 13th, 2011
DNA methylation data availability for 27k and 450k arrays for the selected cancer types (December 20th, 2011):
What can we call batch in TCGA data?
COAD
Manual comparison for colon cancer:
#Exact correspondence between the number in the derived data matrix (single number before
#lvl_x in the file name) and TCGA Archive Name after Level_X.
#derb1 = number in the derived data matrix
#newFiles[,4] = TCGA Archive Name
> table(newFiles$derb1,newFiles[,4])
1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
1 49 0 0 0 0 0 0 0
2 0 13 0 0 0 0 0 0
3 0 0 16 0 0 0 0 0
4 0 0 0 18 0 0 0 0
5 0 0 0 0 32 0 0 0
6 0 0 0 0 0 53 0 0
7 0 0 0 0 0 0 7 0
8 0 0 0 0 0 0 0 25
However I think that TCGA Archive name might be a better surrogate for the processing group. See the table for COAD below. TODO: verify it
#Correlation between processing groups (TCGA Archive Name) and BCR batches:
> table(newFiles$bcr,newFiles[,4])
1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
0820 32 0 0 0 0 0 0 0
0825 0 8 0 0 0 0 0 0
0904 0 0 0 0 32 0 0 0
1020 0 0 0 0 0 53 0 0
1110 0 0 0 0 0 0 7 0
1551 12 5 0 0 0 0 0 0
1552 5 0 0 0 0 0 0 0
A004 0 0 16 0 0 0 0 0
A00B 0 0 0 18 0 0 0 0
A081 0 0 0 0 0 0 0 25
* Processing group/data/BCR batch distribution as of January 19th, 2011
Data downloaded on January 19th, 2012. DNA methylation Level 1 data 27k: 424 patients. DNA methylation 450k Level 1 data: 448 . Even 27k arrays are .idat files which makes them impossible to process at this moment because minfi (Bioconductor package) handles only 450 k arrays.
BCR batch/processing batch distributions for 27 k arrays:
Correlation between BCR batch and the Archive Name based on sdrf file that was downloaded together with the data:
table(KIRC27k$TCGAARchiveNameL1, KIRC27k$BCRbatch)
0859 1186 1284 1287 1303 1332
1.2.0 71 0 0 0 0 0
2.2.0 0 61 0 0 0 0
3.2.0 0 0 96 6 0 0
4.2.0 0 0 0 0 95 0
5.2.0 0 0 0 0 0 95
It shows the same correlation as for the table above that was composed based on the comparison of the files available per batch. In this case there is also the perfect correlattion between Derived Array Data Matrix (number before "lvl") and the TCGA Archive:
> table(KIRC27k$TCGAARchiveNameL1, KIRC27k$ArrayMatNumber)
1 2 3 4 5
1.2.0 71 0 0 0 0
2.2.0 0 61 0 0 0
3.2.0 0 0 102 0 0
4.2.0 0 0 0 95 0
5.2.0 0 0 0 0 95
#Also the same correlation with BCR batch:
> table(KIRC27k$ArrayMatNumber, KIRC27k$BCRbatch)
0859 1186 1284 1287 1303 1332
1 71 0 0 0 0 0
2 0 61 0 0 0 0
3 0 0 96 6 0 0
4 0 0 0 0 95 0
5 0 0 0 0 0 95
Now look at the correlation between the processing batches and BCR batches in 450k arrays.
> table(KIRC450k$TCGAARchiveNameL1_450, KIRC450k$BCRbatch450)
1275 1418 1424 1500 1536 1670
1.3.0 73 0 0 0 0 0
2.3.0 0 89 0 0 0 0
3.3.0 0 0 95 0 0 0
4.3.0 0 0 0 84 0 0
5.3.0 0 0 0 0 62 0
6.3.0 0 0 0 0 0 45
Those are different BCR batches from the 27k arrays. Why would they do that? Are they completely different patients? Extract patients barcodes from the sdrf file:
> pat450[1:5]
[1] "TCGA-B0-4847-01A-01D-1275-05" "TCGA-B0-4714-01A-01D-1275-05" "TCGA-B0-4819-11A-01D-1275-05" "TCGA-AK-3431-01A-02D-1275-05"
[5] "TCGA-B0-4828-01A-01D-1275-05"
> pat27[1:5]
[1] "TCGA-A3-3326-11A-01D-0859-05" "TCGA-A3-3382-11A-01D-0859-05" "TCGA-A3-3317-11A-01D-0859-05" "TCGA-A3-3322-01A-01D-0859-05"
[5] "TCGA-A3-3316-01A-01D-0859-05"
> length(pat450)
[1] 448
> length(pat27)
[1] 424
#Extract only the patient information irregardless of the tissue type, sample type, BCR batch and the processing center.
> shortPat27[1:4]
[1] "TCGA-A3-3326" "TCGA-A3-3382" "TCGA-A3-3317" "TCGA-A3-3322"
> shortPat450[1:4]
[1] "TCGA-B0-4847" "TCGA-B0-4714" "TCGA-B0-4819" "TCGA-AK-3431"
> kIn<-match(shortPat450,shortPat27)
> table(is.na(kIn))
FALSE TRUE
8 440
Apparently there are only 8 patients for whom they repeated the analysis of DNA methylation. I suspect that those are normal samples. So to analyze the data or at least decide if it is worth analyzing it I need to see if these new BCR batches that are they same as the processing batches are correlated with biology. I really hope not.
Data was downloaded on January 20th. 27k arrays (old, semi-processed data format): 166 ; 450k arrays: 144 samples
Batch/processing batch distribution for 27k arrays:
It seems that TCGA Archive Names differ for the Level 1, 2 and 3 data, however they have exact correspondence of the samples. Makes sense, right?
> head(ref27aped)
Comment..TCGA.Archive.Name. Comment..TCGA.Archive.Name..1 Comment..TCGA.Archive.Name..2
TCGA-07-0227-20A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1070-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1071-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1072-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1075-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1076-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
> table(ref27aped[,1],ref27aped[,3])
1.4.0 2.4.0 3.0.0 4.0.0 5.0.0
1.3.0 24 0 0 0 0
2.2.0 0 47 0 0 0
3.1.0 0 0 64 0 0
4.0.0 0 0 0 12 0
5.0.0 0 0 0 0 19
BCR batch and the processing batch are highly correlated as it was already indicated in the table above:
> table(ref27aped[,2],bcrB27)
bcrB27
0689 0848 0979 1096 1198
1.3.0 24 0 0 0 0
2.2.0 0 47 0 0 0
3.1.0 0 0 64 0 0
4.0.0 0 0 0 12 0
5.0.0 0 0 0 0 19
For 450k arrays BCR batch is again highly correlated with the processing batch (TCGA archive name) and it is a completely different set of patients (based on BCR batch)
> table(ref450l[,1],batch450)
batch450
1440 1551 1633 1818 1947
1.1.0 0 3 0 0 0
2.1.0 51 0 0 0 0
3.1.0 0 0 39 0 0
4.1.0 0 0 0 33 0
5.1.0 0 0 0 0 18