...
Important
...
update
...
(January
...
20th,
...
2011):
...
the
...
data
...
below
...
have
...
been
...
corrected
...
for
...
the
...
BCR
...
batch
...
which
...
is
...
not
...
necessarily
...
the
...
processing
...
batch.
...
The
...
dataset
...
needs
...
to
...
be
...
reanalyzed.
...
Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012):
Batch vs clinical traits
Number of clinical traits: 31
Number of batches based on DNA methylation data: 19
Relationship between batch and the center:
Code Block | ||
---|---|---|
| ||
> table(batchID,two) two batchID 02 06 08 12 14 15 16 19 26 27 28 32 41 74 76 81 87 0186 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0199 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0218 0 28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0242 01990 1753 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0279 34 021817 15 0 28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00287 26 16 15 0242 0 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00297 13 0 027919 34 170 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0314 02872 267 16 156 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0297 130337 0 1910 0 16 0 01 0 0 0 0 0 0 0 0 0 0 0 03140392 20 7 60 10 5 6 0 0 05 0 0 0 0 0 0 0 0 0 0 03370521 0 10 4 0 1612 12 0 3 1 7 0 6 0 03 0 0 0 0 0 0 0 0 03920595 0 74 0 10 4 613 0 51 10 0 1 0 4 0 09 0 0 0 0 0 0 05210788 05 12 4 0 12 120 3 70 60 30 0 13 010 04 0 0 0 0 0 05950915 0 40 0 9 4 132 0 10 109 10 40 90 05 04 0 0 0 0 07881228 0 5 121 0 0 32 0 01 3 1 0 0 1315 10 4 0 0 0 0 1481 0 5 09150 3 0 0 0 90 26 0 11 1 0 90 09 0 0 5 41697 0 013 0 0 1 12281 0 10 1 0 03 2 1 0 5 1 31 1 01844 0 1510 40 0 05 0 0 1 14812 0 51 0 31 06 10 01 0 0 62004 0 11 17 0 0 93 0 0 4 1697 0 13 0 0 1 1 0 10 1 0 3 2 1 0 5 1 1 1844 0 10 0 0 5 0 0 1 2 0 1 0 1 6 10 1 0 2004 0 7 0 0 3 0 0 4 0 0 0 0 0 0 0 0 0{code} Relationship between batch and clinical variable, significant correlations (entire table can be found [here|^BatchClinicalInfoCorrelationsGBM.txt]) {csv}GBM_clinical,DataType,NumberOfNAs,Test,Pvalue year_of_initial_pathologic_diagnosis,integer,35,Kruskal-Wallis rank sum test,1.34E-64 pretreatment_history,factor,35,Pearson's Chi-squared test,2.98E-29 histological_type,factor,35,Pearson's Chi-squared test,6.11E-20 initial_pathologic_diagnosis_method,factor,37,Pearson's Chi-squared test,5.24E-15 vital_status,factor,36,Pearson's Chi-squared test,8.36E-15 hormonal_therapy,factor,59,Pearson's Chi-squared test,6.50E-14 targeted_molecular_therapy,factor,65,Pearson's Chi-squared test,5.80E-10 additional_pharmaceutical_therapy,factor,72,Pearson's Chi-squared test,4.19E-05 additional_drug_therapy,factor,73,Pearson's Chi-squared test,5.08E-05 days_to_last_followup,integer,35,Kruskal-Wallis rank sum test,2.90E-04 person_neoplasm_cancer_status,factor,88,Pearson's Chi-squared test,4.81E-04 additional_chemo_therapy,factor,106,Pearson's Chi-squared test,5.18E-03 days_to_death,integer,169,Kruskal-Wallis rank sum test,6.30E-03 days_to_birth,integer,35,Kruskal-Wallis rank sum test,1.13E-02 age_at_initial_pathologic_diagnosis,integer,35,Kruskal-Wallis rank sum test,1.20E-02{csv} h5. Survival vs batch Code for automatic analysis of survival and correlation with clinical traits can be found [here|^SurvivalBasicAnalysis.R]. Kaplan Meier Curve and survival by batch: !KaplanMeierCurveGBM.png|thumbnail! !SurvivalByBatchGBM.png|thumbnail! Summary of the cox proportional hazards model can be found [here|^SurvivalBatchSummaryStatisticsGBM.txt], batch shows significant correlation with survival (Likelihood ratio test= 31 on 17 df, p=0.02; Wald test = 28.17 on 17 df, p=0.04297; Score (logrank) test = 29.48 on 17 df, p=0.03035) I need to state here (and for all cancer types from TCGA that I have analyzed and will analyze) that the p-values for the association of the batches with the clinical traits correspond to ALL batches. However, actual DNA methylation (or other data) may not be available for all batches yet. For example, for GBM I have 9 batches for the downloaded 286 patients. I still see significant correlation between these batches and the clinical traits (p values might be bigger though. For example, correlation between histological type and batches is 3.9e-11). h5. DNA methylation Downloaded the data in the last week of December, 27k, Level1, 294 patients. Weird format for the files, methylated and unmethylated probe intensities are in the first and second columns, different from the format that was used for other datasets. Compared list of DNA methylation patients with the technical info, tech info is available for only 286 patients. Stick with those for the analysis. Didn't split into the subsets of "green" and "red" probes. !GBM_Mvalue_unnorm_distrib.png|thumbnail! !GBM_Mvalue_unnorm_relativeVariance.png|thumbnail! !GBM_Mvalue_noNorm_PC1outliers.png|thumbnail! Technical variables: {code:collapse=true} |
Relationship between batch and clinical variable, significant correlations (entire table can be found here)
Survival vs batch
Code for automatic analysis of survival and correlation with clinical traits can be found here.
Kaplan Meier Curve and survival by batch:
Summary of the cox proportional hazards model can be found here, batch shows significant correlation with survival (Likelihood ratio test= 31 on 17 df, p=0.02; Wald test = 28.17 on 17 df, p=0.04297; Score (logrank) test = 29.48 on 17 df, p=0.03035)
I need to state here (and for all cancer types from TCGA that I have analyzed and will analyze) that the p-values for the association of the batches with the clinical traits correspond to ALL batches. However, actual DNA methylation (or other data) may not be available for all batches yet. For example, for GBM I have 9 batches for the downloaded 286 patients. I still see significant correlation between these batches and the clinical traits (p values might be bigger though. For example, correlation between histological type and batches is 3.9e-11).
DNA methylation
Downloaded the data in the last week of December, 27k, Level1, 294 patients. Weird format for the files, methylated and unmethylated probe intensities are in the first and second columns, different from the format that was used for other datasets.
Compared list of DNA methylation patients with the technical info, tech info is available for only 286 patients. Stick with those for the analysis. Didn't split into the subsets of "green" and "red" probes.
Technical variables:
Code Block | ||
---|---|---|
| ||
#Correlation of DNA methylation (real data) batches and the center:
> table(two,shortMeth$batchID)
two 0186 0199 0218 0392 0521 0595 0788 0915 1228
02 25 17 0 0 0 0 5 0 0
06 0 0 20 7 4 4 12 0 1
12 0 0 0 10 12 4 0 9 0
14 0 0 0 6 12 13 3 2 2
15 0 0 0 0 3 0 0 0 0
16 0 0 0 5 7 1 0 0 1
19 0 0 0 0 6 10 0 9 3
26 0 0 0 0 3 1 0 0 1
27 0 0 0 0 0 4 13 0 0
28 0 0 0 0 0 9 10 0 0
32 0 0 0 0 0 0 4 5 15
41 0 0 0 0 0 0 0 4 4
#Concentration can be probably treated as a continuous variable:
> shortMeth[1:10,3]
[1] 0.15 ug/uL 0.15 ug/uL 0.140 ug/uL 0.14 ug/uL 0.14 ug/uL 0.140 ug/uL
[7] 0.14 ug/uL 0.189 ug/uL 0.167 ug/uL 0.164 ug/uL
81 Levels: 0.071 ug/uL 0.12514558 ug/uL 0.134 ug/uL ... .19 ug/uL
> table(shortMeth$plate_column)
1 2 3 4 5 6
66 68 64 43 24 21
> table(shortMeth$plate_row)
A B C D E F G H
37 36 37 38 37 35 35 31
> table(shortMeth$shortDay)
13-5-2009 13-9-2010 14-4-2010 18-9-2008 20-1-2010 24-8-2009 29-6-2007 3-5-2007
47 27 29 28 47 46 20 17
4-4-2007
25
#Also, day of shipment is the same as the batch which I have seen before
> table(shortMeth$batchID,shortMeth$shortDay)
13-5-2009 13-9-2010 14-4-2010 18-9-2008 20-1-2010 24-8-2009 29-6-2007
0186 0 0 0 0 0 0 0
0199 0 0 0 0 0 0 0
0218 0 0 0 0 0 0 20
0392 0 0 0 28 0 0 0
0521 47 0 0 0 0 0 0
0595 0 0 0 0 0 46 0
0788 0 0 0 0 47 0 0
0915 0 0 29 0 0 0 0
1228 0 27 0 0 0 0 0
3-5-2007 4-4-2007
0186 0 25
0199 17 0
0218 0 0
0392 0 0
0521 0 0
0595 0 0
0788 0 0
0915 0 0
1228 0 0
{code}
|
Correlations
...
with
...
the
...
first
...
principal
...
components:
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
batchID amount concentration plate_column plate_row shortDay
V1 2.675388e-34 2.305474e-24 0.0000837995 0.002205121 0.33206728 2.675388e-34
V2 3.542743e-02 9.080686e-02 0.8088404440 0.160738153 0.28798646 3.542743e-02
V3 1.232019e-01 9.583923e-02 0.2891811969 0.711142850 0.04714336 1.232019e-01
V4 2.845743e-01 2.493076e-01 0.2479220305 0.353589457 0.45597681 2.845743e-01
V5 2.697899e-02 2.963406e-02 0.4862742532 0.956135668 0.51387714 2.697899e-02
V6 1.639557e-39 6.765313e-13 0.0005379532 0.397065857 0.73169940 1.639557e-39{code}
|
Converting
...
concentration
...
to
...
a
...
continuous
...
variable
...
and
...
correlating
...
it
...
with
...
the
...
principal
...
components
...
showed
...
that
...
it
...
is
...
correlation
...
with
...
PC1
...
(p-value
...
=
...
1.117e-06)
...
but
...
not
...
with
...
PC2
...
(p-value
...
=
...
0.5189).
...
Begin
...
by
...
removing
...
batch:
...
Correlation with technical variables:
Code Block | ||
---|---|---|
| ||
batchID amount concentration plate_column plate_row shortDay
V1 0.9078327 0.8534161 0.4472358 0.35474247 0.01031398 0.9078327
V2 0.9999772 0.9989092 0.7858814 0.05639337 0.35637921 0.9999772
V3 0.9729759 0.6323455 0.2021242 0.66153708 0.05678137 0.9729759
V4 0.9999986 0.9803822 0.3390278 0.26653534 0.40375589 0.9999986
V5 0.9999998 0.9891846 0.7570124 0.86562477 0.46551409 0.9999998
V6 0.9999999 0.9830042 0.6796671 0.25051874 0.95548597 0.9999999{code}
|
In
...
addition,
...
removing
...
batch
...
also
...
removed
...
correlation
...
with
...
the
...
concentration
...
(p
...
value
...
was
...
calculated
...
by
...
treating
...
the
...
concentration
...
as
...
a
...
continuous
...
variable).
...
It
...
seems
...
that
...
there
...
is
...
still
...
some
...
correlation
...
with
...
the
...
plate
...
row
...
(PC1).
...
Remove
...
batch
...
and
...
the
...
plate
...
row:
It looks like there is slight decrease in the relative variance after both variables. Also, I swear! Unnormalized data looks better than normalized. Correlation with the tech variables:
Code Block | ||
---|---|---|
| ||
batchID amount concentration plate_column plate_row shortDay
V1 0.9783321 0.8868898 0.4505665 0.27982900 0.9995098 0.9783321
V2 0.9999991 0.9990549 0.7543368 0.04823674 0.9998402 0.9999991
V3 0.9930916 0.7217562 0.2847028 0.68888360 0.9906491 0.9930916
V4 0.9999936 0.9967183 0.3954235 0.24689131 0.9999548 0.9999936
V5 0.9999911 0.9685936 0.7412973 0.95834136 0.9999678 0.9999911
V6 0.9999999 0.9905920 0.6786770 0.22633527 1.0000000 0.9999999{code}
|
In
...
addition,
...
after
...
removing
...
the
...
batch
...
and
...
the
...
plate
...
row
...
I
...
tested
...
for
...
correlation
...
with
...
the
...
center:
...
no
...
correlation
...
(p-value
...
=
...
0.2305).
...
Accept this normalization.
...
ExpressionSet
...
is
...
available.
...