Page Comparison

...

Datasets

...

availability

...

as

...

for

...

December

...

13th, 2011

DNA methylation data availability for 27k and 450k arrays for the selected cancer types (December 20th, 2011):

What can we call batch in TCGA data?

COAD

Manual comparison for colon cancer:

Code Block

collapse	true

#Exact correspondence between the number in the derived data matrix (single number before
#lvl_x in the file name) and TCGA Archive Name after Level_X.
#derb1 = number in the derived data matrix
#newFiles[,4] = TCGA Archive Name
> table(newFiles$derb1,newFiles[,4])

    1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
  1    49     0     0     0     0     0     0     0
  2     0    13     0     0     0     0     0     0
  3     0     0    16     0     0     0     0     0
  4     0     0     0    18     0     0     0     0
  5     0     0     0     0    32     0     0     0
  6     0     0     0     0     0    53     0     0
  7     0     0     0     0     0     0     7     0
  8     0     0     0     0     0     0     0    25

However I think that TCGA Archive name might be a better surrogate for the processing group. See the table for COAD below. TODO: verify it

Code Block

collapse	true

#Correlation between processing groups (TCGA Archive Name) and BCR batches:
> table(newFiles$bcr,newFiles[,4])

       1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
  0820    32     0     0     0     0     0     0     0
  0825     0     8     0     0     0     0     0     0
  0904     0     0     0     0    32     0     0     0
  1020     0     0     0     0     0    53     0     0
  1110     0     0     0     0     0   2  0   0  7  13   0
 0 1551    012     05     0     0     0   3  0     0     0
  1552 16    5 0     0     0     0     0
  4     0     0     0
  A004 18     0     0    16 0     0   5     0     0     0     0
   32A00B     0     0     0   6 18    0     0     0     0     0
   53A081     0     0   7     0     0     0     0     0     0     7     0
  8     0     0     0     0     0     0     0    25{code}
However I think that TCGA Archive name might be a better surrogate for the processing group. See the table for COAD below. TODO: verify it
{code:collapse=true}#Correlation between processing groups (TCGA Archive Name) and BCR batches:
> table(newFiles$bcr,newFiles[,4])

       1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
  0820    32     0     0     0     0     0     0     0
  0825     0     8     0     0     0     0     0     0
  0904     0     0     0     0    32     0     0     0
  1020     0     0     0     0     0    53     0     0
  1110     0     0     0     0     0     0     7     0
  1551    12     5     0     0     0     0     0     0
  1552     5     0     0     0     0     0     0     0
  A004     0     0    16     0     0     0     0     0
  A00B     0     0     0    18     0     0     0     0
  A081     0     0     0     0     0     0     0    25{code}
{csv}Batch on the download page,TCGA Archive code Level 1 from sdrf file,Comment for TCGA Arcive code from sdrf file,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode,Comments
Batch 28,1.1.0,,1,"0820, 1552, 1551",
Batch 29,2.2.0,,2,"0825, 1551",
Batch 30,3.0.0,,3,A004,
Batch 33,4.0.0,,4,A00B,
Batch 36,5.2.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"5, 7","1110, 0904","all 0904 have 5, 1110 (one sample) is 7"
Batch 41,6.0.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"6,7","1020, 1110","all 1020 have 6, 1110 (one sample) is 7"
Batch 45,7.0.0,,7,1110,
Batch 66,8.0.0,,8,A081,
Batch 76,,,no data,,
Batch 89,,,no data,,
Batch 116,,,no data,,
Batch 123,,,no data,,
Batch 132,,,no data,,
Batch 138,,,no data,,
Batch 154,,,no data,,
Batch 157,,,no data,,
Batch 172 ,,,no data,,{csv}\* Processing group/data/BCR batch distribution as of January 19th, 2011

h5. Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in KIRC

Data downloaded on January 19th, 2012. DNA methylation Level 1 data 27k: &nbsp;424 patients. DNA methylation 450k Level 1 data: 448 . Even 27k arrays are .idat files which makes them impossible to process at this moment because minfi (Bioconductor package) handles only 450 k arrays.&nbsp;

BCR batch/processing batch distributions for 27 k arrays:
{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 2 data",Batch as the sixth field in the patient barcode,Comments
Batch 32,1,0859,"Level 1 data is uploaded again as .idat files split into green and red probes, I can't figure out how to get batch from the file names. Now, however, they provide slide number and the array letter!"
Batch 50,2,1186,
Batch 63,no data,,
Batch 64,3,"1287, 1284",
Batch 65,4,1303,
Batch 68,no data,,
Batch 69,5,1332,
Batch 70,no data,,
Batch 82,no data,,
Batch 90,no data,,
Batch 105,no data,,{csv}
Correlation between BCR batch and the Archive Name based on sdrf file that was downloaded together with the data:
{code:collapse=true}table(KIRC27k$TCGAARchiveNameL1, KIRC27k$BCRbatch)

        0859 1186 1284 1287 1303 1332
  1.2.0   71    0    0    0    0    0
  2.2.0    0   61    0    0    0    0
  3.2.0    0    0   96    6    0    0
  4.2.0    0    0    0    0   95    0
  5.2.0    0    0    0    0    0   95{code}
It shows the same correlation as for the table above that was composed based on the comparison of the files available per batch. &nbsp;In this case there is also the perfect correlattion between Derived Array Data Matrix (number before "lvl") and the TCGA Archive:
{code:collapse=true}> table(KIRC27k$TCGAARchiveNameL1, KIRC27k$ArrayMatNumber25

...

Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in KIRC

Data downloaded on January 19th, 2012. DNA methylation Level 1 data 27k: 424 patients. DNA methylation 450k Level 1 data: 448 . Even 27k arrays are .idat files which makes them impossible to process at this moment because minfi (Bioconductor package) handles only 450 k arrays.

BCR batch/processing batch distributions for 27 k arrays:

...

Code Block

collapse	true

table(KIRC27k$TCGAARchiveNameL1, KIRC27k$BCRbatch)

        0859 1186

...

1284 1287

...

1303 1332
  1.2.0   71

...

0    0

...

   0    0

...

2.2.0    0

...

   0    0

...

 0    0
  3.2.0    0

...

   0    0

...

 4.2.0    0    0    0    0   95    0
  5.2.0    0    0    0    0    0

...

It shows the same correlation as for the table above that was composed based on the comparison of the files available per batch. In this case there is also the perfect correlattion between Derived Array Data Matrix (number before "lvl") and the TCGA Archive:

Code Block

collapse	true

> table(KIRC27k$TCGAARchiveNameL1, KIRC27k$ArrayMatNumber)

      0    01   2 0  3  0 4  3 5
  1.2.0   71 0   96    6    0    0
  4    0    0
   2.2.0    0   95    0
  561    0    0    0    0
   3.2.0   950 {code}  Now0 look102 at the correlation0 between the processing0
batches and BCR4.2.0 batches in 450k0 arrays. {code:collapse=true}
> table(KIRC450k$TCGAARchiveNameL1_450, KIRC450k$BCRbatch450)

  0   0  95   0
  5.2.0 1275 1418 14240 1500 1536 16700   1.3.0   730  95

0#Also the same correlation 0with BCR batch:
> 0 table(KIRC27k$ArrayMatNumber, KIRC27k$BCRbatch)

 0   0859 01186 1284 1287 2.3.01303 1332
  01   8971    0    0    0    0   3.3.0 0
  2    0   61 0   950    0    0    0
  4.3.0    0    0   96 0   846    0    0
  4    5.3.0    0    0    0   95 0   620
  5 0   6.3.0    0    0    0    0    0   45
{code}

Those are different BCR batches from the 27k arrays. Why would they do that? Are they completely different patients? Extract patients barcodes from the sdrf file:
{code:collapse=true}> pat450[1:5]
[1] "TCGA-B0-4847-01A-01D-1275-05" "TCGA-B0-4714-01A-01D-1275-05" "TCGA-B0-4819-11A-01D-1275-05" "TCGA-AK-3431-01A-02D-1275-05"
[5] "TCGA-B0-4828-01A-01D-1275-05"
> pat27[1:5]
[1] "TCGA-A3-3326-11A-01D-0859-05" "TCGA-A3-3382-11A-01D-0859-05" "TCGA-A3-3317-11A-01D-0859-05" "TCGA-A3-3322-01A-01D-0859-05"
[5] "TCGA-A3-3316-01A-01D-0859-05"
> length(pat450)
[1] 448
> length(pat27)
[1] 424
#Extract only the patient information irregardless of the tissue type, sample type, BCR batch and the processing center.
> shortPat27[1:4]
[1] "TCGA-A3-3326" "TCGA-A3-3382" "TCGA-A3-3317" "TCGA-A3-3322"
> shortPat450[1:4]
[1] "TCGA-B0-4847" "TCGA-B0-4714" "TCGA-B0-4819" "TCGA-AK-3431"
> kIn<-match(shortPat450,shortPat27)
> table(is.na(kIn))

FALSE  TRUE
    8   440
{code}
Apparently there are only 8 patients for whom they repeated the analysis of DNA methylation. I suspect that those are normal samples. So to analyze the data or at least decide if it is worth analyzing it I need to see if these new BCR batches that are they same as the processing batches are correlated with biology. I really hope not.

h5. Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in LUSC

Data was downloaded on January 20th. 27k arrays (old, semi-processed data format): 166 ; 450k arrays: 144 samples

Batch/processing batch distribution for 27k arrays:
{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode
Batch 23,1,0689
Batch 31,2,0848
Batch 39,3,0979
Batch 53,4,1096
Batch 60,5,1198
Batch 77,no data,
Batch 101,no data,
Batch 140,no data,
Batch 159,no data,
Batch 181,no data,
Batch 193,no data,
{csv}

It seems that TCGA Archive Names differ for the Level 1, 2 and 3 data, however they have exact correspondence of the samples. Makes sense, right?
{code:collapse=true}95

Now look at the correlation between the processing batches and BCR batches in 450k arrays.

Code Block

collapse	true


> table(KIRC450k$TCGAARchiveNameL1_450, KIRC450k$BCRbatch450)

        1275 1418 1424 1500 1536 1670
  1.3.0   73    0    0    0    0    0
  2.3.0    0   89    0    0    0    0
  3.3.0    0    0   95    0    0    0
  4.3.0    0    0    0   84    0    0
  5.3.0    0    0    0    0   62    0
  6.3.0    0    0    0    0    0   45

Those are different BCR batches from the 27k arrays. Why would they do that? Are they completely different patients? Extract patients barcodes from the sdrf file:

Code Block

collapse	true

> pat450[1:5]
[1] "TCGA-B0-4847-01A-01D-1275-05" "TCGA-B0-4714-01A-01D-1275-05" "TCGA-B0-4819-11A-01D-1275-05" "TCGA-AK-3431-01A-02D-1275-05"
[5] "TCGA-B0-4828-01A-01D-1275-05"
> pat27[1:5]
[1] "TCGA-A3-3326-11A-01D-0859-05" "TCGA-A3-3382-11A-01D-0859-05" "TCGA-A3-3317-11A-01D-0859-05" "TCGA-A3-3322-01A-01D-0859-05"
[5] "TCGA-A3-3316-01A-01D-0859-05"
> length(pat450)
[1] 448
> length(pat27)
[1] 424
#Extract only the patient information irregardless of the tissue type, sample type, BCR batch and the processing center.
> shortPat27[1:4]
[1] "TCGA-A3-3326" "TCGA-A3-3382" "TCGA-A3-3317" "TCGA-A3-3322"
> shortPat450[1:4]
[1] "TCGA-B0-4847" "TCGA-B0-4714" "TCGA-B0-4819" "TCGA-AK-3431"
> kIn<-match(shortPat450,shortPat27)
> table(is.na(kIn))

FALSE  TRUE
    8   440

Apparently there are only 8 patients for whom they repeated the analysis of DNA methylation. I suspect that those are normal samples. So to analyze the data or at least decide if it is worth analyzing it I need to see if these new BCR batches that are they same as the processing batches are correlated with biology. I really hope not.

Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in LUSC

Data was downloaded on January 20th. 27k arrays (old, semi-processed data format): 166 ; 450k arrays: 144 samples

Batch/processing batch distribution for 27k arrays:

It seems that TCGA Archive Names differ for the Level 1, 2 and 3 data, however they have exact correspondence of the samples. Makes sense, right?

Code Block

collapse	true

> head(ref27aped)
                             Comment..TCGA.Archive.Name. Comment..TCGA.Archive.Name..1 Comment..TCGA.Archive.Name..2
TCGA-07-0227-20A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"
TCGA-21-1070-01A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"
TCGA-21-1071-01A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"
TCGA-21-1072-01A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"
TCGA-21-1075-01A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"
TCGA-21-1076-01A-01D-0689-05 "1.3.0"                     "1.3.0"                       "1.4.0"


> table(ref27aped[,1],ref27aped[,3])

        1.4.0 2.4.0 3.0.0 4.0.0 5.0.0
  1.3.0    24     0     0     0     0
  2.2.0     0    47     0     0     0
  3.1.0     0     0    64     0     0
  4.0.0     0     0     0    12     0
  5.0.0     0     0     0     0    19
{code}

BCR

...

batch

...

and

...

the

...

processing

...

batch

...

are

...

highly

...

correlated

...

as

...

it

...

was

...

already

...

indicated

...

in

...

the

...

table

...

above:

...

:=}

Code Block

collapse

	true

> table(ref27aped[,2],bcrB27)
       bcrB27
        0689 0848 0979 1096 1198
  1.3.0   24    0    0    0    0
  2.2.0    0   47    0    0    0
  3.1.0    0    0   64    0    0
  4.0.0    0    0    0   12    0
  5.0.0    0    0    0    0   19
{code}

For

...

450k

...

arrays

...

BCR

...

batch

...

is

...

again

...

highly

...

correlated

...

with

...

the

...

processing

...

batch

...

(TCGA

...

Version	Old Version 15	New Version Current
Changes made by	Vitalina Komashko (Unlicensed)	Vitalina Komashko (Unlicensed)
Saved on	Jan 20, 2012	Jan 20, 2012

Versions Compared

Key

Datasets

availability

as

for

December

13th, 2011

DNA methylation data availability for 27k and 450k arrays for the selected cancer types (December 20th, 2011):

What can we call batch in TCGA data?

COAD

Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in KIRC

Did TCGA shoot themselves in the foot? Compare sample by batch distribution for 27 and 450k arrays in LUSC