}> table(center,batch)
batch
center 0820 0825 0904 1020 1110 1116 A004 A00B A081
A6 4 2 0 4 0 0 0 0 0
AA 26 5 31 44 4 0 15 17 13
AF 4 0 0 0 0 1 0 0 0
AG 12 10 8 0 0 17 9 9 0
AY 0 0 0 0 2 0 0 0 0
{code}
For the combined DNA methylation dataset (237 patients) batch vs clinical traits significant correlations (P value = 0.05) are listed. Note: traits 1 and 2 should be used for the calculation of the survival and then Cox model should be used to show that batch is predictive of survival.
{csv}"COLON_clinical traits","DataType","NumberOfNAs","Test","Pvalue"
"days_to_last_followup","integer",0,"Kruskal-Wallis rank sum test",1.19E-23
"days_to_last_known_alive","integer",0,"Kruskal-Wallis rank sum test",2.63E-19
"days_to_form_completion","integer",0,"Kruskal-Wallis rank sum test",1.32E-14
"tumor_tissue_site","factor",2,"Pearson's Chi-squared test",3.41E-14
"histological_type","factor",7,"Pearson's Chi-squared test",9.69E-11
"year_of_initial_pathologic_diagnosis","integer",0,"Kruskal-Wallis rank sum test",3.77E-09
"anatomic_organ_subdivision","factor",2,"Pearson's Chi-squared test",8.00E-07
"vascular_invasion_present","factor",27,"Pearson's Chi-squared test",8.85E-06
"vital_status","factor",1,"Pearson's Chi-squared test",3.20E-05
"lymphatic_invasion_present","factor",12,"Pearson's Chi-squared test",3.53E-05
"residual_tumor","factor",6,"Pearson's Chi-squared test",5.04E-04
"number_of_lymphnodes_examined","integer",2,"Kruskal-Wallis rank sum test",1.48E-03
"kras_gene_analysis_performed","factor",1,"Pearson's Chi-squared test",5.44E-02
"gender","factor",0,"Pearson's Chi-squared test",6.25E-02
"person_neoplasm_cancer_status","factor",0,"Pearson's Chi-squared test",8.27E-02
"tumor_stage","factor",27,"Pearson's Chi-squared test",8.89E-02
{csv}
h4. Data normalization
For the normalization I didn't split the probes into "red" and "green". Approach: perform SVD, identify tech variables that are highly correlated with the first PCs and remove the variables. Use M value to express methylation status of each probes. The M value was calculated in the following way (the code was take from Bioconductor lumi package, functions for the analysis of Illumina Human Methylation 27k data):
SVD on the whole matrix, relative variance before normalization: !Colon_RelativeVarianceNoNorm.png|thumbnail!
Created a file with technical variables, merged day, month and year into a single variable. Tech file, head:
{code}> head(k)
row.names bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom
1 63 TCGA-A6-2674-01A TCGA-A6-2674-01A-02D-0820-05 26.7 uL 0096500431
2 150 TCGA-A6-2677-01A TCGA-A6-2677-01A-01D-0820-05 26.7 uL 0096500430
3 182 TCGA-A6-2678-01A TCGA-A6-2678-01A-01D-0820-05 26.7 uL 0096500407
4 334 TCGA-A6-2683-01A TCGA-A6-2683-01A-01D-0820-05 26.7 uL 0096500406
5 1186 TCGA-AA-3514-01A TCGA-AA-3514-01A-02D-0820-05 26.7 uL 0096500383
6 1254 TCGA-AA-3517-01A TCGA-AA-3517-01A-01D-0820-05 26.7 uL 0096500382
concentration shipment plate_column plate_row batch shortID
1 0.14 ug/uL 22-2-2010 1 B 0820 TCGA-A6-2674
2 0.15 ug/uL 22-2-2010 1 C 0820 TCGA-A6-2677
3 0.15 ug/uL 22-2-2010 1 D 0820 TCGA-A6-2678
4 0.16 ug/uL 22-2-2010 1 E 0820 TCGA-A6-2683
5 0.15 ug/uL 22-2-2010 1 F 0820 TCGA-AA-3514
6 0.13 ug/uL 22-2-2010 1 G 0820 TCGA-AA-3517{code}
Ignore bcr_sample_barcode, bcr_aliquot_barcode, these are not tech variables.
Here are the correlation of the tech variables with the first 4 PCs.
{code}> k
bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment
V1 0.4700026 0.4877315 0.04672452 0.3017151 1.601134e-07 8.334984e-34
V2 0.4510048 0.4877315 0.48950693 0.5316963 3.392774e-01 5.699460e-03
V3 0.4510048 0.4877315 0.18777379 0.6925323 1.141026e-01 7.733551e-02
V4 0.4513158 0.4877315 0.58958142 0.6705972 6.211816e-01 6.664406e-02
plate_column plate_row batch shortID
V1 3.009061e-05 0.001491552 2.491659e-33 0.4702436
V2 1.947964e-01 0.989450736 1.349671e-04 0.4511082
V3 9.369050e-01 0.463424193 1.437189e-01 0.4511082
V4 3.395272e-01 0.688070623 1.012497e-01 0.4514159
{code}
Outliers of the first principal component: !Colon_Mval_noNorm_PC1outliers.png|thumbnail!
Seems that the batch has the biggest effect on the data. Remove the batch, relative variance after that:
!Colon_RelativeVariance_noBatch.png|thumbnail!
Calculate correlation with the clinical variables after removing the batch effect:
{code}> k
bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment
V1 0.4691690 0.4877315 0.6106733 0.5489680 0.4263794 0.9956323
V2 0.4534001 0.4877315 0.8880134 0.7069529 0.4187061 0.9950998
V3 0.4536645 0.4877315 0.8589996 0.7271536 0.4000974 0.9856611
V4 0.4543147 0.4877315 0.8880134 0.6459688 0.9405449 0.9999056
plate_column plate_row batch shortID
V1 0.04191285 0.5627797 0.9981536 0.4690814
V2 0.97317059 0.6821160 0.9999954 0.4536160
V3 0.55783761 0.0703874 0.9992998 0.4537400
V4 0.52248494 0.8718225 0.9999849 0.4543834{code}
Seems that this took care of all the problems, there is still a borderline significant correlation of the PC1 with the plate column |