...
Important update (January 20th, 2011): the data below have been corrected for the BCR batch which is not necessarily the processing batch. The dataset needs to be reanalyzed.
Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012)
Analysis of batch vs clinical traits
Number of clinical traits: 84
Number of batches based on tumor DNA methylation data (samples retrieved according to this pattern: "TCGA-..
...
....-0..
...
..D-....-05"):
...
24
...
Correlation
...
between
...
center
...
and
...
batches
...
('two'=center
...
(second
...
field
...
in
...
the
...
patient
...
barcode)):
...
Code Block | ||||||
---|---|---|---|---|---|---|
| =
| }|||||
> table(batchID,two) two batchID A1 A2 A7 A8 AC AN AO AQ AR B6 BH C8 D8 E2 E9 EW GI GM HN A00Y 0 3 7 66 0 12 2 0 0 0 4 0 0 0 0 0 0 0 0 A032 0 22 0 14 0 19 10 1 0 16 10 0 0 0 0 0 0 0 0 A058 0 2 1 3 0 0 11 0 0 4 26 0 0 0 0 0 0 0 0 A088 7 14 0 1 0 0 1 0 8 9 7 0 0 0 0 0 0 0 0 A10A 0 12 0 0 0 9 0 0 5 9 4 0 0 0 0 0 0 0 0 A10N 0 1 0 0 0 1 12 2 0 1 1 0 0 9 0 0 0 0 0 A10P 7 16 1 4 0 0 12 0 8 13 32 0 0 0 0 0 0 0 0 A112 1 11 1 0 0 0 5 0 3 5 18 20 9 15 0 0 0 0 0 A12E 0 0 0 0 0 0 0 0 1 0 20 2 0 21 0 0 0 0 0 A12R 0 0 3 0 0 0 0 0 15 0 14 0 0 11 0 0 0 0 0 A138 0 0 0 0 0 0 0 0 1 0 7 6 0 2 0 0 0 0 0 A13K 0 7 1 0 0 0 5 2 0 3 18 4 19 6 0 8 0 0 0 A145 6 0 0 0 0 0 1 0 0 0 0 0 0 10 4 19 0 0 0 A148 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A14H 0 1 0 0 0 0 0 0 0 0 1 0 7 5 10 1 0 0 0 A14N 0 0 0 0 0 0 0 1 0 1 2 0 20 1 6 1 0 0 0 A161 0 0 0 0 2 0 0 0 0 1 3 0 3 2 17 0 0 0 0 A16A 0 4 6 0 1 0 0 0 22 0 1 3 0 0 9 0 0 0 0 A16G 0 3 0 0 0 0 0 0 0 0 1 8 13 0 3 0 1 0 0 A17F 0 0 0 0 4 0 0 0 0 0 0 0 1 0 0 3 0 0 0 A17Z 0 0 0 0 2 0 0 0 4 0 0 0 0 0 1 0 0 6 0 A18O 0 0 0 0 1 0 0 0 5 1 1 0 0 0 1 0 0 7 1 A19F 0 2 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 A19Z 0 0 0 0 5 0 0 0 1 0 1 0 0 1 0 0 0 0 0 {code} Significant batch-clinical traits correlations (the entire list can be found [here|^BatchClinicalInfoCorrelationsBRCA.csv]): {csv} "BRCA_clinical_traits","DataType","NumberOfNAs","Test","Pvalue" "tissue_prospective_collection_indicator","factor",35,"Pearson's Chi-squared test",4.47E-62 "tissue_retrospective_collection_indicator","factor",35,"Pearson's Chi-squared test",4.47E-62 "year_of_initial_pathologic_diagnosis","integer",34,"Kruskal-Wallis rank sum test",3.15E-32 "breast_carcinoma_first_surgical_procedure_name","factor",54,"Pearson's Chi-squared test",5.45E-32 "days_to_last_followup","integer",73,"Kruskal-Wallis rank sum test",3.07E-31 "days_to_form_completion","integer",34,"Kruskal-Wallis rank sum test",5.70E-31 "first_pathologic_diagnosis_biospecimen_acquisition_method_type","factor",123,"Pearson's Chi-squared test",3.39E-28 "breast_tumor_clinical_m_stage","factor",35,"Pearson's Chi-squared test",1.06E-22 "axillary_lymph_node_stage_method_type","factor",223,"Pearson's Chi-squared test",9.33E-19 "breast_tumor_pathologic_n_stage","factor",34,"Pearson's Chi-squared test",2.19E-17 "lab_proc_her2_neu_immunohistochemistry_receptor_status","factor",41,"Pearson's Chi-squared test",6.22E-16 "breast_carcinoma_estrogen_receptor_status","factor",34,"Pearson's Chi-squared test",1.85E-13 "breast_carcinoma_progesterone_receptor_status","factor",34,"Pearson's Chi-squared test",8.87E-13 "vital_status","factor",34,"Pearson's Chi-squared test",2.38E-09 "anatomic_site_location_descriptor","factor",119,"Pearson's Chi-squared test",1.03E-07 "age_at_initial_pathologic_diagnosis","integer",34,"Kruskal-Wallis rank sum test",5.87E-06 "days_to_birth","integer",34,"Kruskal-Wallis rank sum test",6.68E-06 "lab_procedure_her2_neu_in_situ_hybrid_outcome_type","factor",194,"Pearson's Chi-squared test",3.18E-05 "person_menopause_status","factor",161,"Pearson's Chi-squared test",5.70E-05 "breast_tumor_pathologic_grouping_stage","factor",40,"Pearson's Chi-squared test",7.40E-05 "her2_immunohistochemistry_level_result","factor",351,"Pearson's Chi-squared test",1.72E-04 "breast_tumor_pathologic_t_stage","factor",34,"Pearson's Chi-squared test",2.82E-04 "pos_finding_lymph_node_hematoxylin_and_eosin_staining_microscopy_count","integer",177,"Kruskal-Wallis rank sum test",6.49E-04 "cytokeratin_immunohistochemistry_staining_method_micrometastasis_indicator","factor",324,"Pearson's Chi-squared test",8.61E-04 "person_neoplasm_cancer_status","factor",284,"Pearson's Chi-squared test",7.95E-03 "breast_cancer_optical_measurement_histologic_type","factor",34,"Pearson's Chi-squared test",1.47E-02 "disease_surgical_margin_status","factor",82,"Pearson's Chi-squared test",3.70E-02 {csv} h5. Correlation with survival Relevant clinical traits: days to the last follow-up (27), vital status (83), days to death (24), days to last know alive (28), summaries: {code:collapse=true} 0 0 0 0 0 |
Significant batch-clinical traits correlations (the entire list can be found here):
Correlation with survival
Relevant clinical traits: days to the last follow-up (27), vital status (83), days to death (24), days to last know alive (28), summaries:
Code Block | ||
---|---|---|
| ||
> summary(clinical[,27]) # days to the last follow up
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 140.0 457.0 815.8 1194.0 6795.0 73.0
> table(clinical[,83]) #vital status
DECEASED LIVING
93 725
> summary(clinical[,24]) # days to death
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
157 811 1563 1744 2520 4456 759
> summary(clinical[,28]) # days to last known alive
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 293.5 607.5 1068.0 1442.0 6795.0 508.0{code}
|
It
...
seems
...
that
...
similarly
...
to
...
the
...
colon
...
cancer
...
combined
...
datasets
...
days
...
to
...
last
...
known
...
alive
...
is
...
similar
...
to
...
the
...
days
...
to
...
the
...
last
...
follow-up,
...
however
...
days
...
to
...
the
...
last
...
follow
...
up
...
contains
...
more
...
information
...
(fewer
...
NAs),
...
use
...
it
...
for
...
construction
...
of
...
the
...
survival
...
object. No patients missed information for both days to the last follow up and days to death. The survival object was created in the same way as for the analyses of other TCGA cancer datasets. Info is available (here and here)
Kaplan Meier curve and survival plots break down by batch:
Here is the summary of the survival vs batch:
Code Block | ||
---|---|---|
| ||
> summary(coxph(surv~methM[,2])) Call: coxph(formula = surv ~ methM[, 2]) n= 818, number of events= 93 (34 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) methM[, 2]A032 -6.961e-01 4.985e-01 5.148e-01 -1.352 0.1763 methM[, 2]A058 -2.321e+00 9.819e-02 1.084e+00 -2.140 0.0323 * methM[, 2]A088 -5.512e-01 5.763e-01 5.413e-01 -1.018 0.3086 methM[, 2]A10A -2.238e-01 7.995e-01 5.565e-01 -0.402 0.6876 methM[, 2]A10N -1.455e+00 2.334e-01 8.217e-01 -1.771 0.0766 . methM[, 2]A112 -9.643e-01 3.812e-01 5.847e-01 -1.649 0.0991 . methM[, 2]A12E 9.108e-01 2.486e+00 5.015e-01 1.816 0.0694 . methM[, 2]A12R -1.794e+00 1.663e-01 1.082e+00 -1.657 0.0975 . methM[, 2]A138 8.921e-01 2.440e+00 5.487e-01 1.626 0.1040 methM[, 2]A13K 2.542e-01 1.289e+00 4.825e-01 0.527 0.5983 methM[, 2]A145 -9.748e-01 3.773e-01 1.081e+00 -0.902 0.3673 methM[, 2]A14H 3.164e-01 1.372e+00 8.215e-01 0.385 0.7002 methM[, 2]A14N 9.591e-01 2.609e+00 1.089e+00 0.880 0.3786 methM[, 2]A161 -1.382e-01 8.709e-01 7.183e-01 -0.192 0.8475 methM[, 2]A16A -1.347e+00 2.600e-01 8.206e-01 -1.642 0.1007 methM[, 2]A16G -1.714e+01 3.615e-08 5.030e+03 -0.003 0.9973 methM[, 2]A17F -1.731e+01 3.039e-08 1.338e+04 -0.001 0.9990 methM[, 2]A17Z -1.725e+01 3.216e-08 4.025e+03 -0.004 0.9966 methM[, 2]A18O -1.434e+00 2.383e-01 1.088e+00 -1.318 0.1875 methM[, 2]A19Z NA NA 0.000e+00 NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 methM[, 2]A032 4.985e-01 2.006e+00 0.18176 1.3672 methM[, 2]A058 9.819e-02 1.018e+01 0.01172 0.8223 methM[, 2]A088 5.763e-01 1.735e+00 0.19946 1.6649 methM[, 2]A10A 7.995e-01 1.251e+00 0.26858 2.3797 methM[, 2]A10N 2.334e-01 4.284e+00 0.04664 1.1685 methM[, 2]A112 3.812e-01 2.623e+00 0.12121 1.1991 methM[, 2]A12E 2.486e+00 4.022e-01 0.93037 6.6440 methM[, 2]A12R 1.663e-01 6.012e+00 0.01993 1.3879 methM[, 2]A138 2.440e+00 4.098e-01 0.83256 7.1522 methM[, 2]A13K 1.289e+00 7.755e-01 0.50082 3.3197 methM[, 2]A145 3.773e-01 2.651e+00 0.04531 3.1409 methM[, 2]A14H 1.372e+00 7.288e-01 0.27424 6.8653 methM[, 2]A14N 2.609e+00 3.832e-01 0.30851 22.0693 methM[, 2]A161 8.709e-01 1.148e+00 0.21309 3.5598 methM[, 2]A16A 2.600e-01 3.846e+00 0.05206 1.2985 methM[, 2]A16G 3.615e-08 2.766e+07 0.00000 Inf methM[, 2]A17F 3.039e-08 3.290e+07 0.00000 Inf methM[, 2]A17Z 3.216e-08 3.109e+07 0.00000 Inf methM[, 2]A18O 2.383e-01 4.196e+00 0.02825 2.0108 methM[, 2]A19Z NA NA NA NA Rsquare= 0.069 (max possible= 0.67 ) Likelihood ratio test= 58.61 on 19 df, p=6.4e-06 Wald test = 50 on 19 df, p=0.0001311 Score (logrank) test = 69.46 on 19 df, p=1.129e-07 Warning messages: 1: In fitter(X, Y, strats, offset, init, control, weights = weights, : Loglik converged before variable 16,17,18 ; beta may be infinite. 2: In coxph(surv ~ methM[, 2]) : X matrix deemed to be singular; variable 20{code} _ |
It
...
seems
...
that
...
there
...
are
...
a
...
lot
...
of
...
errors,
...
I
...
wonder
...
why.
...
I
...
also
...
don't
...
understand
...
where
...
those
...
observations
...
come
...
from that are deleted due to missingness. Need to ask someone to help clarify this output. Update (January 5, 2012): there are NAs for some batches because I had factor levels left in the batch vector but no data for those levels. Fixed the problem with that. "Deleted due to missingness" also fixed as I figured out how that I need to be more careful about using 'match' for subsetting.
DNA methylation data
December 21st, 2011: 27k and 450k arrays are available. Downloaded Level 1 450k data. It seems that they started splitting green and red probes into 2 separate files and they also provide now the Illumina's idat files which are the bead level data (not tab delimited files). I need to find a way to process them, it seems that Bioconductor beadarray package can be used to read these files and do some bead level normalization (summarization too?). The Level2 data contains already summarized and normalized data (tab delimited files with CpG ID, value for methylated and value for unmethylated probes), however it is available only for 91 patients. Also tried to download 27k arrays available for breast cancer, however the data is available for ~26 patients (they stopped running those arrays?). I guess I need to figure out how to process Level 1 data.