...
Datasets
...
availability
...
as
...
for
...
December
...
13th,
...
2011
Wiki Markup |
---|
{csv}Cancer,Code,Methylation,Expression,miRNA,Sequence
Acute Myeloid Leukemia,AML,193/0/0,0/0/0,0/0/0,200/200/0
Bladder Urothenial Carcinoma,BLCA,38/11/0,0/0/0,0/0/0,38/40/0
Brain Lower Grade Glioma,LGG,24/2/0,27/0/0/,0/0/0,80/80/0
Breast Invasive Carcinoma, BRCA,816/117/8,536/62/2,0/0/0,807/834/28
Cervical squamous cell carcinoma,CESC,0/0/0,0/0/0,0/0/0,37/37/2
Colon adenocarcinoma,COAD,166/37/3,154/12/9,0/0/0,421/474/19
Glioblastoma multiforme,GBM,541/0/1,572/0/11,492/0/,580/515/28
Head and neck squamous cell carcinoma,HNSC,0/0/0,0/0/0,0/0/0,170/191/6
Kidney renal cell carcinoma,KIRC,391/346/6,72/0/1,0/0/0,494/504/11
Kidney renal papillary cell carcinoma,KIRP,16/5/1,16/0/0/,0/0/0,43/46/0
Liver hepatocellular carcinoma,LIHC,0/0/0,0/0/0,0/0/0,55/69/1
Lung adenocarcinoma,LUAD,127/24/1,32/0/1,0/0/0,205/234/2
Lung squamous cell carcinoma,LUSC,228/67/3,154/0/1,0/0/0, 229/246/23
Lymphoid neoplasm diffuse large B-cell Lymphoma,DLBC,0/0/0,0/0/0,0/0/0,0/0/0
Lymphoid neoplasm Non-Hodkins Lymphoma,LNNH, 0/0/0,0/0/0,0/0/0,0/0/0
Ovarian serous adenocarcinoma,OV,573/4/9,589/0/9,586/0/8,590/574/30
Pancreatic adenocarcinoma,PAAD,0/0/0,0/0/0,0/0/0,14/14/14
Prostate adenocarcinoma,PRAD,0/0/0,0/0/0,0/0/0,83/96/1
Rectum adenocarcinoma,READ,69/5/3,69/3/2,0/0/0,165/170/4
Skin cutateneous melanoma,SKCM,0/0/0,0/0/0,0/0/0,0/0/0
Stomach adenocarcinoma,STAD,118/45/1/,0/0/0,0/0/0,135/148/2
Thyroid carcinoma,THCA,0/0/0,0/0/0,0/0/0,86/96/1
Uterine corpus endometrioid carcinoma,UCEC,373/16/12,54/0/0,0/0/0,371/376/14{csv} |
...
DNA
...
methylation
...
data
...
availability
...
for
...
27k
...
and
...
450k
...
arrays
...
for
...
the
...
selected
...
cancer
...
types
...
(December
...
20th,
...
2011):
...
Wiki Markup |
---|
{csv}Cancer Type,Code,Methylation/# patients,27k,450 k Level 1 type,450 k Level 1 # tumors,450 k Level 2 # tumors,Number of batches,Barcode # batches
Acute Myeloid Leukemia,LAML,193/0/0,188,txt (GenomeStudio export),190,didn't download,2,3
Breast Invasive Carcinoma,BRCA,816/117/8,26,idat (bead level),didn't download,91,,24
COLON (COAD+READ),butt,166/37/3,237,not available,not available,not available,,9
Glioblastoma Multiforme,GBM,541/0/1,294,txt (GenomeStudio export),74,didn't download,19,19
Kidney Renal Cell Carcinoma,KIRC,391/346/6,219,idat (bead level),didn't download,173,12,16
Lung Adenocarcinoma,LUAD,127/24/1,32,not available,not available,not available,11,8
Lung Squamous Cell Carcinoma,LUSC,228/67/3,134,idat (bead level),didn't download,95,11,12
Stomach Adenocarcinoma,STAD,118/45/1,66,idat (bead level),didn't download,52,6,6
Uterine Corpus Endometrioid Carcinoma,UCEC,373/16/12,117,idat (bead level),didn't download,256,19,6{csv} |
...
What
...
can
...
we
...
call
...
batch
...
in
...
TCGA
...
data?
...
COAD
Manual comparison for colon cancer:
Code Block | ||
---|---|---|
| ||
#Exact correspondence between the number in the derived data matrix (single number before
#lvl_x in the file name) and TCGA Archive Name after Level_X.
#derb1 = number in the derived data matrix
#newFiles[,4] = TCGA Archive Name
> table(newFiles$derb1,newFiles[,4])
1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
1 49 0 0 0 0 0 0 0
2 0 13 0 0 0 0 0 0
3 0 0 16 0 0 0 0 0
4 0 0 0 18 0 0 0 0
5 0 0 0 0 32 0 0 0
6 0 0 0 0 0 53 0 0
7 0 0 0 0 0 0 7 0
8 0 0 0 0 0 0 0 25{code}
|
However
...
I
...
think
...
that
...
TCGA
...
Archive
...
name
...
might
...
be
...
a
...
better
...
surrogate
...
for
...
the
...
processing
...
group.
...
See
...
the
...
table
...
for
...
COAD
...
below.
...
TODO:
...
verify
...
it
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
#Correlation between processing groups (TCGA Archive Name) and BCR batches:
> table(newFiles$bcr,newFiles[,4])
1.1.0 2.2.0 3.0.0 4.0.0 5.2.0 6.0.0 7.0.0 8.0.0
0820 32 0 0 0 0 0 0 0
0825 0 8 0 0 0 0 0 0
0904 0 0 0 0 32 0 0 0
1020 0 0 0 0 0 53 0 0
1110 0 0 0 0 0 0 7 0
1551 12 5 0 0 0 0 0 0
1552 5 0 0 0 0 0 0 0
A004 0 0 16 0 0 0 0 0
A00B 0 0 0 18 0 0 0 0
A081 0 0 0 0 0 0 0 25{code}
|
Wiki Markup |
---|
{csv}Batch on the download page,TCGA Archive code Level 1 from sdrf file,Comment for TCGA Arcive code from sdrf file,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode,Comments
Batch 28,1.1.0,,1,"0820, 1552, 1551",
Batch 29,2.2.0,,2,"0825, 1551",
Batch 30,3.0.0,,3,A004,
Batch 33,4.0.0,,4,A00B,
Batch 36,5.2.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"5, 7","1110, 0904","all 0904 have 5, 1110 (one sample) is 7"
Batch 41,6.0.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"6,7","1020, 1110","all 1020 have 6, 1110 (one sample) is 7"
Batch 45,7.0.0,,7,1110,
Batch 66,8.0.0,,8,A081,
Batch 76,,,no data,,
Batch 89,,,no data,,
Batch 116,,,no data,,
Batch 123,,,no data,,
Batch 132,,,no data,,
Batch 138,,,no data,,
Batch 154,,,no data,,
Batch 157,,,no data,,
Batch 172 ,,,no data,,{csv} |
...
...
Processing
...
group/data/BCR
...
batch
...
distribution
...
as
...
of
...
January
...
19th,
...
2011
...
Did
...
TCGA
...
shoot
...
themselves
...
in
...
the
...
foot?
...
Compare
...
sample
...
by
...
batch
...
distribution
...
for
...
27
...
and
...
450k
...
arrays
...
in
...
KIRC
...
Data
...
downloaded
...
on
...
January
...
19th,
...
2012.
...
DNA
...
methylation
...
Level
...
1
...
data
...
27k:
...
424 patients.
...
DNA
...
methylation
...
450k
...
Level
...
1
...
data:
...
448
...
.
...
Even
...
27k
...
arrays
...
are
...
.idat
...
files
...
which
...
makes
...
them
...
impossible
...
to
...
process
...
at
...
this
...
moment
...
because
...
minfi
...
(Bioconductor
...
package)
...
handles
...
only
...
450
...
k
...
arrays.
...
BCR batch/processing
...
batch
...
distributions
...
for
...
27
...
k
...
arrays:
...
Wiki Markup |
---|
{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 2 data",Batch as the sixth field in the patient barcode,Comments
Batch 32,1,0859,"Level 1 data is uploaded again as .idat files split into green and red probes, I can't figure out how to get batch from the file names. Now, however, they provide slide number and the array letter!"
Batch 50,2,1186,
Batch 63,no data,,
Batch 64,3,"1287, 1284",
Batch 65,4,1303,
Batch 68,no data,,
Batch 69,5,1332,
Batch 70,no data,,
Batch 82,no data,,
Batch 90,no data,,
Batch 105,no data,,{csv} |
...
Correlation
...
between
...
BCR
...
batch
...
and
...
the
...
Archive
...
Name
...
based
...
on
...
sdrf
...
file
...
that
...
was
...
downloaded
...
together
...
with
...
the
...
data:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
table(KIRC27k$TCGAARchiveNameL1, KIRC27k$BCRbatch)
0859 1186 1284 1287 1303 1332
1.2.0 71 0 0 0 0 0
2.2.0 0 61 0 0 0 0
3.2.0 0 0 96 6 0 0
4.2.0 0 0 0 0 95 0
5.2.0 0 0 0 0 0 95{code}
|
It
...
shows
...
the
...
same
...
correlation
...
as
...
for
...
the
...
table
...
above
...
that
...
was
...
composed
...
based
...
on
...
the
...
comparison
...
of
...
the
...
files
...
available
...
per
...
batch.
...
In this case there is also the perfect correlattion between Derived Array Data Matrix (number before "lvl")
...
and
...
the
...
TCGA
...
Archive:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> table(KIRC27k$TCGAARchiveNameL1, KIRC27k$ArrayMatNumber)
1 2 3 4 5
1.2.0 71 0 0 0 0
2.2.0 0 61 0 0 0
3.2.0 0 0 102 0 0
4.2.0 0 0 0 95 0
5.2.0 0 0 0 0 95
#Also the same correlation with BCR batch:
> table(KIRC27k$ArrayMatNumber, KIRC27k$BCRbatch)
0859 1186 1284 1287 1303 1332
1 71 0 0 0 0 0
2 0 61 0 0 0 0
3 0 0 96 6 0 0
4 0 0 0 0 95 0
5 0 0 0 0 0 95
{code}
|
Now
...
look
...
at
...
the
...
correlation
...
between
...
the
...
processing
...
batches
...
and
...
BCR
...
batches
...
in
...
450k
...
arrays.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> table(KIRC450k$TCGAARchiveNameL1_450, KIRC450k$BCRbatch450)
1275 1418 1424 1500 1536 1670
1.3.0 73 0 0 0 0 0
2.3.0 0 89 0 0 0 0
3.3.0 0 0 95 0 0 0
4.3.0 0 0 0 84 0 0
5.3.0 0 0 0 0 62 0
6.3.0 0 0 0 0 0 45
{code}
|
Those
...
are
...
different
...
BCR
...
batches
...
from
...
the
...
27k
...
arrays.
...
Why
...
would
...
they
...
do
...
that?
...
Are
...
they
...
completely
...
different
...
patients?
...
Extract
...
patients
...
barcodes
...
from
...
the
...
sdrf
...
file:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> pat450[1:5]
[1] "TCGA-B0-4847-01A-01D-1275-05" "TCGA-B0-4714-01A-01D-1275-05" "TCGA-B0-4819-11A-01D-1275-05" "TCGA-AK-3431-01A-02D-1275-05"
[5] "TCGA-B0-4828-01A-01D-1275-05"
> pat27[1:5]
[1] "TCGA-A3-3326-11A-01D-0859-05" "TCGA-A3-3382-11A-01D-0859-05" "TCGA-A3-3317-11A-01D-0859-05" "TCGA-A3-3322-01A-01D-0859-05"
[5] "TCGA-A3-3316-01A-01D-0859-05"
> length(pat450)
[1] 448
> length(pat27)
[1] 424
#Extract only the patient information irregardless of the tissue type, sample type, BCR batch and the processing center.
> shortPat27[1:4]
[1] "TCGA-A3-3326" "TCGA-A3-3382" "TCGA-A3-3317" "TCGA-A3-3322"
> shortPat450[1:4]
[1] "TCGA-B0-4847" "TCGA-B0-4714" "TCGA-B0-4819" "TCGA-AK-3431"
> kIn<-match(shortPat450,shortPat27)
> table(is.na(kIn))
FALSE TRUE
8 440
{code}
|
Apparently
...
there
...
are
...
only
...
8
...
patients
...
for
...
whom
...
they
...
repeated
...
the
...
analysis
...
of
...
DNA
...
methylation.
...
I
...
suspect
...
that
...
those
...
are
...
normal
...
samples.
...
So
...
to
...
analyze
...
the
...
data
...
or
...
at
...
least
...
decide
...
if
...
it
...
is
...
worth
...
analyzing
...
it
...
I
...
need
...
to
...
see
...
if
...
these
...
new
...
BCR
...
batches
...
that
...
are
...
they
...
same
...
as
...
the
...
processing
...
batches
...
are
...
correlated
...
with
...
biology.
...
I
...
really
...
hope
...
not.
...
Did
...
TCGA
...
shoot
...
themselves
...
in
...
the
...
foot?
...
Compare
...
sample
...
by
...
batch
...
distribution
...
for
...
27
...
and
...
450k
...
arrays
...
in
...
LUSC
...
Data
...
was
...
downloaded
...
on
...
January
...
20th.
...
27k
...
arrays
...
(old,
...
semi-processed
...
data
...
format):
...
166
...
;
...
450k
...
arrays:
...
144
...
samples
...
Batch/processing
...
batch
...
distribution
...
for
...
27k
...
arrays:
...
Wiki Markup |
---|
{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode
Batch 23,1,0689
Batch 31,2,0848
Batch 39,3,0979
Batch 53,4,1096
Batch 60,5,1198
Batch 77,no data,
Batch 101,no data,
Batch 140,no data,
Batch 159,no data,
Batch 181,no data,
Batch 193,no data,
{csv} |
...
It
...
seems
...
that
...
TCGA
...
Archive
...
Names
...
differ
...
for
...
the
...
Level
...
1,
...
2
...
and
...
3
...
data,
...
however
...
they
...
have
...
exact
...
correspondence
...
of
...
the
...
samples.
...
Makes
...
sense,
...
right?
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> head(ref27aped)
Comment..TCGA.Archive.Name. Comment..TCGA.Archive.Name..1 Comment..TCGA.Archive.Name..2
TCGA-07-0227-20A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1070-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1071-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1072-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1075-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
TCGA-21-1076-01A-01D-0689-05 "1.3.0" "1.3.0" "1.4.0"
> table(ref27aped[,1],ref27aped[,3])
1.4.0 2.4.0 3.0.0 4.0.0 5.0.0
1.3.0 24 0 0 0 0
2.2.0 0 47 0 0 0
3.1.0 0 0 64 0 0
4.0.0 0 0 0 12 0
5.0.0 0 0 0 0 19
{code}
|
BCR
...
batch
...
and
...
the
...
processing
...
batch
...
are
...
highly
...
correlated
...
as
...
it
...
was
...
already
...
indicated
...
in
...
the
...
table
...
above:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> table(ref27aped[,2],bcrB27)
bcrB27
0689 0848 0979 1096 1198
1.3.0 24 0 0 0 0
2.2.0 0 47 0 0 0
3.1.0 0 0 64 0 0
4.0.0 0 0 0 12 0
5.0.0 0 0 0 0 19
{code}
|
For
...
450k
...
arrays
...
BCR
...
batch
...
is
...
again
...
highly
...
correlated
...
with
...
the
...
processing
...
batch
...
(TCGA
...
archive
...
name)
...
and
...
it
...
is
...
a
...
completely
...
different
...
set
...
of
...
patients
...
(based
...
on
...
BCR
...
batch)
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> table(ref450l[,1],batch450)
batch450
1440 1551 1633 1818 1947
1.1.0 0 3 0 0 0
2.1.0 51 0 0 0 0
3.1.0 0 0 39 0 0
4.1.0 0 0 0 33 0
5.1.0 0 0 0 0 18
{code} |