Important update (January 20th, 2011): the data below have been corrected for the BCR batch which is not necessarily the processing batch. The dataset needs to be reanalyzed.
Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012):
Batch 1,2,0186
Batch 2,3,0199
Batch 3,4,0218
Batch 4,no data,
Batch 5,no data,
Batch 6,no data,
Batch 7,no data,
Batch 8,no data,
Batch 10,1,0392
Batch 16,5,0521
Batch 20,6,0595
Batch 26,7,0788
Batch 38,8,0915
Batch 62,9,1228
Batch 79,no data,
Batch 111,no data,
Batch 130,no data,
Batch 174,no data,
Batch vs clinical traits
Number of clinical traits: 31
Number of batches based on DNA methylation data: 19
Relationship between batch and the center:
Relationship between batch and clinical variable, significant correlations (entire table can be found here)
year_of_initial_pathologic_diagnosis,integer,35,Kruskal-Wallis rank sum test,1.34E-64
pretreatment_history,factor,35,Pearson's Chi-squared test,2.98E-29
histological_type,factor,35,Pearson's Chi-squared test,6.11E-20
initial_pathologic_diagnosis_method,factor,37,Pearson's Chi-squared test,5.24E-15
vital_status,factor,36,Pearson's Chi-squared test,8.36E-15
hormonal_therapy,factor,59,Pearson's Chi-squared test,6.50E-14
targeted_molecular_therapy,factor,65,Pearson's Chi-squared test,5.80E-10
additional_pharmaceutical_therapy,factor,72,Pearson's Chi-squared test,4.19E-05
additional_drug_therapy,factor,73,Pearson's Chi-squared test,5.08E-05
days_to_last_followup,integer,35,Kruskal-Wallis rank sum test,2.90E-04
person_neoplasm_cancer_status,factor,88,Pearson's Chi-squared test,4.81E-04
additional_chemo_therapy,factor,106,Pearson's Chi-squared test,5.18E-03
days_to_death,integer,169,Kruskal-Wallis rank sum test,6.30E-03
days_to_birth,integer,35,Kruskal-Wallis rank sum test,1.13E-02
age_at_initial_pathologic_diagnosis,integer,35,Kruskal-Wallis rank sum test,1.20E-02
Survival vs batch
Code for automatic analysis of survival and correlation with clinical traits can be found here.
Kaplan Meier Curve and survival by batch:
Summary of the cox proportional hazards model can be found here, batch shows significant correlation with survival (Likelihood ratio test= 31 on 17 df, p=0.02; Wald test = 28.17 on 17 df, p=0.04297; Score (logrank) test = 29.48 on 17 df, p=0.03035)
I need to state here (and for all cancer types from TCGA that I have analyzed and will analyze) that the p-values for the association of the batches with the clinical traits correspond to ALL batches. However, actual DNA methylation (or other data) may not be available for all batches yet. For example, for GBM I have 9 batches for the downloaded 286 patients. I still see significant correlation between these batches and the clinical traits (p values might be bigger though. For example, correlation between histological type and batches is 3.9e-11).
DNA methylation
Downloaded the data in the last week of December, 27k, Level1, 294 patients. Weird format for the files, methylated and unmethylated probe intensities are in the first and second columns, different from the format that was used for other datasets.
Compared list of DNA methylation patients with the technical info, tech info is available for only 286 patients. Stick with those for the analysis. Didn't split into the subsets of "green" and "red" probes.
Technical variables:
Correlations with the first principal components:
Converting concentration to a continuous variable and correlating it with the principal components showed that it is correlation with PC1 (p-value = 1.117e-06) but not with PC2 (p-value = 0.5189).
Begin by removing batch:
Correlation with technical variables:
In addition, removing batch also removed correlation with the concentration (p value was calculated by treating the concentration as a continuous variable). It seems that there is still some correlation with the plate row (PC1). Remove batch and the plate row:
It looks like there is slight decrease in the relative variance after both variables. Also, I swear! Unnormalized data looks better than normalized. Correlation with the tech variables:
In addition, after removing the batch and the plate row I tested for correlation with the center: no correlation (p-value = 0.2305).
Accept this normalization. ExpressionSet is available.