Important update (January 20th, 2011): the data below have been corrected for the BCR batch which is not necessarily the processing batch. The dataset needs to be reanalyzed.
Batch vs clinical traits
Batch vs center:
> table(batchID,center) center batchID B7 BR CD CG D7 EQ F1 1129 0 31 0 0 0 0 0 1156 0 12 0 23 0 0 0 1601 2 0 7 16 3 1 0 1801 0 9 0 3 10 0 1 1883 0 11 0 0 5 0 1
Most significant correlations (complete list can be found here)
residual_tumor,factor,8,Pearson's Chi-squared test,8.06E-17
year_of_initial_pathologic_diagnosis,integer,1,Kruskal-Wallis rank sum test,2.88E-13
days_to_form_completion,integer,1,Kruskal-Wallis rank sum test,3.72E-11
days_to_last_followup,integer,1,Kruskal-Wallis rank sum test,5.15E-11
primary_tumor_pathologic_spread,factor,1,Pearson's Chi-squared test,1.68E-06
histological_type,factor,6,Pearson's Chi-squared test,5.82E-06
lymphnode_pathologic_spread,factor,1,Pearson's Chi-squared test,6.09E-05
number_of_lymphnodes_examined,integer,53,Kruskal-Wallis rank sum test,1.25E-04
vital_status,factor,1,Pearson's Chi-squared test,2.95E-03
tumor_stage,factor,31,Pearson's Chi-squared test,3.49E-02
Batch vs survival
No correlation with survival. For some reason I got NAs and an error for the last batch although it is definitely not because of the unused factor levels.
DNA methylation
27k arrays, 66 patients. Create M value, don't split between red and green. SVD:
Summary of the technical variables:
> summary(methS) batchID amount concentration plate_column plate_row 1129:31 16.9 uL: 1 0.13 ug/uL: 6 1:16 A :10 1156:35 26.7 uL:65 0.14 ug/uL:27 2:13 C : 9 0.15 ug/uL:25 3:13 D : 9 0.16 ug/uL: 7 4:10 F : 9 0.17 ug/uL: 1 5: 9 B : 8 6: 5 E : 8 (Other):13 shortDay 21-7-2010:31 28-7-2010:35
So this dataset has only 2 batches. Lets see if they have any correlation with the principal components:
Looks like the second PC is highly correlated but the batch and also 4th and 8th. The second PC explains 10% of the data variance. Remove the batch:
Removing batch took care of all other correlations. I was also wondering about correlation of batch with the clinical traits in this smaller dataset (actual DNA methylation data, not potential). Correlation of batch and histological type: 0.001488 (Chi-square test) and 3.0e-05 (Fisher test); residual tumor: 7.465e-07 (Chi-square test) and 6.536e-09 (Fisher test). There weren't any significant correlation with tumor grade. With tumor stage: 0.04773 (Chi-square), 0.009894 (Fisher test).
Consider the data to be normalized.
Expression set object is available.