...
Important
...
update
...
(January
...
20th,
...
2011):
...
the
...
data
...
below
...
have
...
been
...
corrected
...
for
...
the
...
BCR
...
batch
...
which
...
is
...
not
...
necessarily
...
the
...
processing
...
batch.
...
The
...
dataset
...
needs
...
to
...
be
...
reanalyzed.
...
Correlation between BCR batch and the processing batch for 27k arrays in COAD (January 20, 2012)
Wiki Markup |
---|
{csv}Batch on the download page,TCGA Archive code Level 1 from sdrf file,Comment for TCGA Arcive code from sdrf file,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode,Comments
Batch 28,1.1.0,,1,"0820, 1552, 1551",
Batch 29,2.2.0,,2,"0825, 1551",
Batch 30,3.0.0,,3,A004,
Batch 33,4.0.0,,4,A00B,
Batch 36,5.2.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"5, 7","1110, 0904","all 0904 have 5, 1110 (one sample) is 7"
Batch 41,6.0.0,1110 is not there probably because sdrf file I have is for the tumor samples only,"6,7","1020, 1110","all 1020 have 6, 1110 (one sample) is 7"
Batch 45,7.0.0,,7,1110,
Batch 66,8.0.0,,8,A081,
Batch 76,,,no data,,
Batch 89,,,no data,,
Batch 116,,,no data,,
Batch 123,,,no data,,
Batch 132,,,no data,,
Batch 138,,,no data,,
Batch 154,,,no data,,
Batch 157,,,no data,,
Batch 172 ,,,no data,,{csv} |
...
Correlation
...
between
...
BCR
...
batch
...
and
...
the
...
processing
...
batch
...
for
...
27k
...
arrays
...
in
...
READ
...
(January
...
20,
...
2012):
...
Wiki Markup |
---|
{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 1",Batch as the sixth field in the patient barcode,Comments
Batch 42,"1, 2, 5, 6","0820, 1552, 0825, 0904, 1116",1 corresponds to 0820 and 1552; 2 corresponds to 0825; 1 sample that corresponds to 5 is 0904; 1 samples that corresponds to 6 is 1116
Batch 43,"3, 4","A004, A00B","A004 corresponds to 3, A00B corresponds to 4"
Batch 46,6,1116,
Batch 67,4,A00B,only one sample is available
Batch 102,no data,,
Batch 122,no data,,
Batch 133,no data,,
Batch 139,no data,,
Batch 158,no data,,{csv |
...
} |
Batch vs clinical traits,
...
data
...
collection
...
Correlation
...
of
...
batch
...
with
...
clinical
...
traits
...
(READ:
...
can
...
be
...
found
...
here,
...
COAD:
...
can
...
be
...
found
...
here).
...
READ,
...
number
...
of
...
batches:
...
11,
...
COAD,
...
number
...
of
...
batches:
...
15
...
Additional
...
technical
...
variables
...
to be considered in normalization:
...
batch
...
(6th
...
field
...
in
...
the
...
patient's
...
barcode),
...
center
...
(2nd
...
field
...
in
...
the
...
patients'
...
barcode),
...
amount,
...
barcode
...
bottom
...
(?),
...
concentration,
...
day,
...
month,
...
year
...
(to
...
be
...
concatenated
...
and
...
considered
...
as
...
a
...
single
...
variable),
...
plate
...
row,
...
plate
...
column.
...
This
...
information
...
is
...
available
...
from
...
clinical_aliquot_public_CANCERTYPE.txt
...
files.
DNA methylation:
...
27k,
...
level1
...
downloaded
...
on
...
December
...
19th,
...
2011.
...
Total number of COAD patients (tumor/matched
...
normal/unmatched
...
normal)
...
=
...
212
...
Total
...
number
...
of
...
READ
...
patients
...
(tumor/matched
...
normal/unmatched
...
normal)
...
=
...
81
...
Place
...
them
...
in
...
the
...
same
...
directory
...
and
...
combine
...
in
...
a
...
single
...
matrix
...
of
...
M
...
values
...
(log2(methylated/unmethylated))
...
Total
...
number
...
of
...
tumor
...
samples:
...
237
...
Extracted
...
batch
...
and
...
center
...
information,
...
relationship
...
between
...
them:
...
Code Block |
---|
> table(center,batch)
batch
center 0820 0825 0904 1020 1110 1116 A004 A00B A081
A6 4 2 0 4 0 0 0 0 0
AA 26 5 31 44 4 0 15 17 13
AF 4 0 0 0 0 1 0 0 0
AG 12 10 8 0 0 17 9 9 0
AY 0 0 0 0 2 0 0 0 0
{code}
|
For
...
the
...
combined
...
DNA
...
methylation
...
dataset
...
(237
...
patients)
...
batch
...
vs
...
clinical
...
traits
...
significant
...
correlations
...
(P
...
value
...
=
...
0.05)
...
are
...
listed.
...
Note:
...
traits
...
1
...
and
...
2
...
should
...
be
...
used
...
for
...
the
...
calculation
...
of
...
the
...
survival
...
and
...
then
...
Cox
...
model
...
should
...
be
...
used
...
to
...
show
...
that
...
batch
...
is
...
predictive
...
of
...
survival.
...
Complete
...
list
...
can
...
be
...
found
...
Wiki Markup |
---|
{csv}"COLON_clinical traits","DataType","NumberOfNAs","Test","Pvalue"
"days_to_last_followup","integer",0,"Kruskal-Wallis rank sum test",1.19E-23
"days_to_last_known_alive","integer",0,"Kruskal-Wallis rank sum test",2.63E-19
"days_to_form_completion","integer",0,"Kruskal-Wallis rank sum test",1.32E-14
"tumor_tissue_site","factor",2,"Pearson's Chi-squared test",3.41E-14
"histological_type","factor",7,"Pearson's Chi-squared test",9.69E-11
"year_of_initial_pathologic_diagnosis","integer",0,"Kruskal-Wallis rank sum test",3.77E-09
"anatomic_organ_subdivision","factor",2,"Pearson's Chi-squared test",8.00E-07
"vascular_invasion_present","factor",27,"Pearson's Chi-squared test",8.85E-06
"vital_status","factor",1,"Pearson's Chi-squared test",3.20E-05
"lymphatic_invasion_present","factor",12,"Pearson's Chi-squared test",3.53E-05
"residual_tumor","factor",6,"Pearson's Chi-squared test",5.04E-04
"number_of_lymphnodes_examined","integer",2,"Kruskal-Wallis rank sum test",1.48E-03
"kras_gene_analysis_performed","factor",1,"Pearson's Chi-squared test",5.44E-02
"gender","factor",0,"Pearson's Chi-squared test",6.25E-02
"person_neoplasm_cancer_status","factor",0,"Pearson's Chi-squared test",8.27E-02
"tumor_stage","factor",27,"Pearson's Chi-squared test",8.89E-02
{csv} |
...
Correlation
...
with
...
survival
...
Survival
...
object,
...
relevant
...
clinical
...
variables:
...
days
...
to
...
death
...
(8),
...
days
...
to
...
the
...
last
...
follow
...
up
...
(11),
...
days
...
to
...
the
...
last
...
know
...
alive
...
(12),
...
vitalin
...
status
...
(48).
...
Summaries:
...
Code Block |
---|
> table(bio[,48]) # vital status
DECEASED LIVING
22 214
> summary(bio[,12]) # days to the last known alive
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 31.0 204.3 332.0 1581.0
> summary(bio[,8]) # days to death
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.25 105.50 374.10 535.00 1581.00 215.00
> summary(bio[,11]) # days to the last follow - up
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 30.0 197.1 332.0 1581.0{code}
|
It
...
is
...
interesting
...
to
...
see
...
that
...
days
...
to
...
the
...
last
...
follow
...
up
...
and
...
the
...
days
...
to
...
the
...
last
...
know
...
alive
...
are
...
almost
...
identical.
...
In
...
some
...
cases
...
when
...
days
...
to
...
the
...
last
...
known
...
alive
...
has
...
a
...
number
...
(not
...
0)
...
days
...
to
...
the
...
last
...
follow
...
up
...
is
...
showing
...
0.
...
I
...
decided
...
to
...
use
...
days
...
to
...
the
...
last
...
known
...
alive
...
instead
...
of
...
the
...
days
...
to
...
the
...
last
...
follow
...
up
...
for
...
the
...
construction
...
of
...
the
...
survival
...
object.
...
There are no patients that would have no information for the days to the last follow up and vital status. Create the survival object:
Code Block | ||
---|---|---|
| ||
> head(x)
VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath
TCGA-A6-2674 2 523 523 NA
TCGA-A6-2677 2 541 541 NA
TCGA-A6-2678 2 437 437 NA
TCGA-A6-2683 2 472 472 NA
TCGA-AA-3514 2 31 0 NA
TCGA-AA-3517 2 31 0 NA
> dim(x[is.na(x[, 3]) & is.na(x[, 4]), ])
[1] 0 4
> status<-rep(1,237)
> status[which(is.na(x[, 4]))] <- 0
> x[is.na(x[, 4]), 4] <- x[which(is.na(x[, 4]), 4), 2]
> x1<-cbind(x,status)
> head(x1)
VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath status
TCGA-A6-2674 2 523 523 523 0
TCGA-A6-2677 2 541 541 541 0
TCGA-A6-2678 2 437 437 437 0
TCGA-A6-2683 2 472 472 472 0
TCGA-AA-3514 2 31 0 31 0
TCGA-AA-3517 2 31 0 31 0
> surv <- Surv(as.numeric(x1[, 4]), as.numeric(x1[, 5]))
# Kaplan Meier curve:
> plot(survfit(surv ~ 1), xlab = "days to death", ylab = "Probability", main = "Kaplan-Meier survival curve for TCGA \n COLON (COAD+READ) (237 patients)")
# Correlation of survival with batch:
> plot(survfit(surv ~ k[, 10]), xlab = "days to death", ylab = "Probability", col = 1:9, main = "TCGA COLON survival correlation with batch")
> legend(750, 0.4, levels(as.factor(k[,10])), text.col = 1:9,ncol=3)
> summary(coxph(surv ~ k[,10]))
Call:
coxph(formula = surv ~ k[, 10])
n= 237, number of events= 22
coef exp(coef) se(coef) z Pr(>|z|)
k[, 10]0825 -1.802e+01 1.499e-08 1.522e+04 -0.001 0.9991
k[, 10]0904 2.521e+00 1.244e+01 1.238e+00 2.036 0.0417 *
k[, 10]1020 1.686e+00 5.395e+00 1.170e+00 1.441 0.1497
k[, 10]1110 2.491e+00 1.207e+01 1.228e+00 2.028 0.0425 *
k[, 10]1116 1.584e+00 4.875e+00 1.429e+00 1.109 0.2676
k[, 10]A004 -1.790e+01 1.684e-08 4.977e+03 -0.004 0.9971
k[, 10]A00B 7.077e-01 2.029e+00 1.109e+00 0.638 0.5233
k[, 10]A081 4.812e-01 1.618e+00 1.197e+00 0.402 0.6876
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
k[, 10]0825 1.499e-08 6.671e+07 0.0000 Inf
k[, 10]0904 1.244e+01 8.038e-02 1.0989 140.86
k[, 10]1020 5.395e+00 1.853e-01 0.5447 53.44
k[, 10]1110 1.207e+01 8.282e-02 1.0874 134.07
k[, 10]1116 4.875e+00 2.051e-01 0.2962 80.22
k[, 10]A004 1.684e-08 5.940e+07 0.0000 Inf
k[, 10]A00B 2.029e+00 4.928e-01 0.2309 17.83
k[, 10]A081 1.618e+00 6.181e-01 0.1550 16.89
Concordance= 0.743 (se = 0.082 )
Rsquare= 0.083 (max possible= 0.521 )
Likelihood ratio test= 20.5 on 8 df, p=0.008608
Wald test = 9.14 on 8 df, p=0.331
Score (logrank) test = 20.77 on 8 df, p=0.007792{code}
|
Kaplan
...
-
...
Meier
...
curve
...
and
...
correlation
...
with
...
batch:
Data normalization
For the normalization I didn't split the probes into "red" and "green".
...
Approach:
...
perform
...
SVD,
...
identify
...
tech
...
variables
...
that
...
are
...
highly
...
correlated
...
with
...
the
...
first
...
PCs
...
and
...
remove
...
the
...
variables.
...
Use
...
M
...
value
...
to
...
express
...
methylation
...
status
...
of
...
each
...
probes.
...
The
...
M
...
value
...
was
...
calculated
...
in
...
the
...
following
...
way
...
(the
...
code
...
was
...
take
...
from
...
Bioconductor
...
lumi
...
package,
...
functions
...
for
...
the
...
analysis
...
of
...
Illumina
...
Human
...
Methylation
...
27k
...
data):
...
SVD
...
on
...
the
...
whole
...
matrix,
...
relative variance before normalization:
Created a file with technical variables, merged day, month and year into a single variable. Tech file, head:
Code Block |
---|
> head(k)
row.names bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom
1 63 TCGA-A6-2674-01A TCGA-A6-2674-01A-02D-0820-05 26.7 uL 0096500431
2 150 TCGA-A6-2677-01A TCGA-A6-2677-01A-01D-0820-05 26.7 uL 0096500430
3 182 TCGA-A6-2678-01A TCGA-A6-2678-01A-01D-0820-05 26.7 uL 0096500407
4 334 TCGA-A6-2683-01A TCGA-A6-2683-01A-01D-0820-05 26.7 uL 0096500406
5 1186 TCGA-AA-3514-01A TCGA-AA-3514-01A-02D-0820-05 26.7 uL 0096500383
6 1254 TCGA-AA-3517-01A TCGA-AA-3517-01A-01D-0820-05 26.7 uL 0096500382
concentration shipment plate_column plate_row batch shortID
1 0.14 ug/uL 22-2-2010 1 B 0820 TCGA-A6-2674
2 0.15 ug/uL 22-2-2010 1 C 0820 TCGA-A6-2677
3 0.15 ug/uL 22-2-2010 1 D 0820 TCGA-A6-2678
4 0.16 ug/uL 22-2-2010 1 E 0820 TCGA-A6-2683
5 0.15 ug/uL 22-2-2010 1 F 0820 TCGA-AA-3514
6 0.13 ug/uL 22-2-2010 1 G 0820 TCGA-AA-3517{code}
|
Ignore
...
bcr_sample_barcode,
...
bcr_aliquot_barcode,
...
these
...
are
...
not
...
tech
...
variables.
...
Here
...
are
...
the
...
correlation
...
of
...
the
...
tech
...
variables
...
with
...
the
...
first
...
4
...
PCs.
...
Code Block |
---|
> k
bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment
V1 0.4700026 0.4877315 0.04672452 0.3017151 1.601134e-07 8.334984e-34
V2 0.4510048 0.4877315 0.48950693 0.5316963 3.392774e-01 5.699460e-03
V3 0.4510048 0.4877315 0.18777379 0.6925323 1.141026e-01 7.733551e-02
V4 0.4513158 0.4877315 0.58958142 0.6705972 6.211816e-01 6.664406e-02
plate_column plate_row batch shortID
V1 3.009061e-05 0.001491552 2.491659e-33 0.4702436
V2 1.947964e-01 0.989450736 1.349671e-04 0.4511082
V3 9.369050e-01 0.463424193 1.437189e-01 0.4511082
V4 3.395272e-01 0.688070623 1.012497e-01 0.4514159
{code}
|
Outliers
...
of
...
the
...
first
...
principal
...
component:
...
Seems that the batch has the biggest effect on the data. Remove the batch, relative variance after that:
Calculate correlation with the clinical variables after removing the batch effect:
Code Block |
---|
> k
bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment
V1 0.4691690 0.4877315 0.6106733 0.5489680 0.4263794 0.9956323
V2 0.4534001 0.4877315 0.8880134 0.7069529 0.4187061 0.9950998
V3 0.4536645 0.4877315 0.8589996 0.7271536 0.4000974 0.9856611
V4 0.4543147 0.4877315 0.8880134 0.6459688 0.9405449 0.9999056
plate_column plate_row batch shortID
V1 0.04191285 0.5627797 0.9981536 0.4690814
V2 0.97317059 0.6821160 0.9999954 0.4536160
V3 0.55783761 0.0703874 0.9992998 0.4537400
V4 0.52248494 0.8718225 0.9999849 0.4543834{code}
|
Seems
...
that
...
this
...
took
...
care
...
of
...
all
...
the
...
problems,
...
there
...
is
...
still
...
a
...
borderline significant correlation of the PC1 with the plate column, not going to worry about it.
Look at the outliers of the first principal component after removing the batch:
Conclusion: consider the data to be normalized. eSet object for this data set is available.