Wiki Markup |
---|
h4. Batch vs clinical traits, data collection Correlation of batch with clinical traits (READ: can be found here, COAD: can be found here). READ, number of batches: 11, COAD, number of batches: 15 Additional technical variables to be considered in normalization: batch (6th field in the patient's barcode), center (2nd field in the patients' barcode), amount, barcode bottom \(?), concentration, day, month, year (to be concatenated and considered as a single variable), plate row, plate column. This information is available from clinical_aliquot_public_CANCERTYPE.txt files. DNA methylation: 27k, level1 downloaded on December 19th, 2011. Total number of COAD patients (tumor/matched normal/unmatched normal) = 212 Total number of READ patients (tumor/matched normal/unmatched normal) = 81 Place them in the same directory and combine in a single matrix of *M values (log2(methylated/unmethylated))* Total number of tumor samples: 237 Extracted batch and center information, relationship between them: {code}> table(center,batch) batch center 0820 0825 0904 1020 1110 1116 A004 A00B A081 A6 4 2 0 4 0 0 0 0 0 AA 26 5 31 44 4 0 15 17 13 AF 4 0 0 0 0 1 0 0 0 AG 12 10 8 0 0 17 9 9 0 AY 0 0 0 0 2 0 0 0 0 {code} For the combined DNA methylation dataset (237 patients) batch vs clinical traits significant correlations (P value = 0.05) are listed. Note: traits 1 and 2 should be used for the calculation of the survival and then Cox model should be used to show that batch is predictive of survival. Complete list can be found here {csv}"COLON_clinical traits","DataType","NumberOfNAs","Test","Pvalue" "days_to_last_followup","integer",0,"Kruskal-Wallis rank sum test",1.19E-23 "days_to_last_known_alive","integer",0,"Kruskal-Wallis rank sum test",2.63E-19 "days_to_form_completion","integer",0,"Kruskal-Wallis rank sum test",1.32E-14 "tumor_tissue_site","factor",2,"Pearson's Chi-squared test",3.41E-14 "histological_type","factor",7,"Pearson's Chi-squared test",9.69E-11 "year_of_initial_pathologic_diagnosis","integer",0,"Kruskal-Wallis rank sum test",3.77E-09 "anatomic_organ_subdivision","factor",2,"Pearson's Chi-squared test",8.00E-07 "vascular_invasion_present","factor",27,"Pearson's Chi-squared test",8.85E-06 "vital_status","factor",1,"Pearson's Chi-squared test",3.20E-05 "lymphatic_invasion_present","factor",12,"Pearson's Chi-squared test",3.53E-05 "residual_tumor","factor",6,"Pearson's Chi-squared test",5.04E-04 "number_of_lymphnodes_examined","integer",2,"Kruskal-Wallis rank sum test",1.48E-03 "kras_gene_analysis_performed","factor",1,"Pearson's Chi-squared test",5.44E-02 "gender","factor",0,"Pearson's Chi-squared test",6.25E-02 "person_neoplasm_cancer_status","factor",0,"Pearson's Chi-squared test",8.27E-02 "tumor_stage","factor",27,"Pearson's Chi-squared test",8.89E-02 {csv} Correlation with survival. Survival object, relevant clinical variables: days to death (8), days to the last follow up (11), days to the last know alive (12), vitalin status (48). Summaries: {code}> table(bio[,48]) # vital status DECEASED LIVING 22 214 > summary(bio[,12]) # days to the last known alive Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 31.0 204.3 332.0 1581.0 > summary(bio[,8]) # days to death Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 37.25 105.50 374.10 535.00 1581.00 215.00 > summary(bio[,11]) # days to the last follow - up Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 30.0 197.1 332.0 1581.0{code} It is interesting to see that days to the last follow up and the days to the last know alive are almost identical. In some cases when days to the last known alive has a number (not 0) days to the last follow up is showing 0. I decided to use days to the last known alive instead of the days to the last follow up for the construction of the survival object. !COLON_daysalive_vs_daysFollowUp.png|thumbnail! There are no patients that would have no information for the days to the last follow up and vital status. Create the survival object: {code:collapse=true}> head(x) VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath TCGA-A6-2674 2 523 523 NA TCGA-A6-2677 2 541 541 NA TCGA-A6-2678 2 437 437 NA TCGA-A6-2683 2 472 472 NA TCGA-AA-3514 2 31 0 NA TCGA-AA-3517 2 31 0 NA > dim(x[is.na(x[, 3]) & is.na(x[, 4]), ]) [1] 0 4 > status<-rep(1,237) > status[which(is.na(x[, 4]))] <- 0 > x[is.na(x[, 4]), 4] <- x[which(is.na(x[, 4]), 4), 2] > x1<-cbind(x,status) > head(x1) VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath status TCGA-A6-2674 2 523 523 523 0 TCGA-A6-2677 2 541 541 541 0 TCGA-A6-2678 2 437 437 437 0 TCGA-A6-2683 2 472 472 472 0 TCGA-AA-3514 2 31 0 31 0 TCGA-AA-3517 2 31 0 31 0 > surv <- Surv(as.numeric(x1[, 4]), as.numeric(x1[, 5])) # Kaplan Meier curve: > plot(survfit(surv ~ 1), xlab = "days to death", ylab = "Probability", main = "Kaplan-Meier survival curve for TCGA \n COLON (COAD+READ) (237 patients)") # Correlation of survival with batch: > plot(survfit(surv ~ k[, 10]), xlab = "days to death", ylab = "Probability", col = 1:9, main = "TCGA COLON survival correlation with batch") > legend(750, 0.4, levels(as.factor(k[,10])), text.col = 1:9,ncol=3) > summary(coxph(surv ~ k[,10])) Call: coxph(formula = surv ~ k[, 10]) n= 237, number of events= 22 coef exp(coef) se(coef) z Pr(>|z|) k[, 10]0825 -1.802e+01 1.499e-08 1.522e+04 -0.001 0.9991 k[, 10]0904 2.521e+00 1.244e+01 1.238e+00 2.036 0.0417 * k[, 10]1020 1.686e+00 5.395e+00 1.170e+00 1.441 0.1497 k[, 10]1110 2.491e+00 1.207e+01 1.228e+00 2.028 0.0425 * k[, 10]1116 1.584e+00 4.875e+00 1.429e+00 1.109 0.2676 k[, 10]A004 -1.790e+01 1.684e-08 4.977e+03 -0.004 0.9971 k[, 10]A00B 7.077e-01 2.029e+00 1.109e+00 0.638 0.5233 k[, 10]A081 4.812e-01 1.618e+00 1.197e+00 0.402 0.6876 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 k[, 10]0825 1.499e-08 6.671e+07 0.0000 Inf k[, 10]0904 1.244e+01 8.038e-02 1.0989 140.86 k[, 10]1020 5.395e+00 1.853e-01 0.5447 53.44 k[, 10]1110 1.207e+01 8.282e-02 1.0874 134.07 k[, 10]1116 4.875e+00 2.051e-01 0.2962 80.22 k[, 10]A004 1.684e-08 5.940e+07 0.0000 Inf k[, 10]A00B 2.029e+00 4.928e-01 0.2309 17.83 k[, 10]A081 1.618e+00 6.181e-01 0.1550 16.89 Concordance= 0.743 (se = 0.082 ) Rsquare= 0.083 (max possible= 0.521 ) Likelihood ratio test= 20.5 on 8 df, p=0.008608 Wald test = 9.14 on 8 df, p=0.331 Score (logrank) test = 20.77 on 8 df, p=0.007792{code} Kaplan - Meier curve and correlation with batch: !COLON_KaplanMeierCurve.png|thumbnail! !COLON_SurvivalBatchPlot.png|thumbnail! h4. Data normalization For the normalization I didn't split the probes into "red" and "green". Approach: perform SVD, identify tech variables that are highly correlated with the first PCs and remove the variables. Use M value to express methylation status of each probes. The M value was calculated in the following way (the code was take from Bioconductor lumi package, functions for the analysis of Illumina Human Methylation 27k data): SVD on the whole matrix, relative variance before normalization: !Colon_RelativeVarianceNoNorm.png|thumbnail! Created a file with technical variables, merged day, month and year into a single variable. Tech file, head: {code}> head(k) row.names bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom 1 63 TCGA-A6-2674-01A TCGA-A6-2674-01A-02D-0820-05 26.7 uL 0096500431 2 150 TCGA-A6-2677-01A TCGA-A6-2677-01A-01D-0820-05 26.7 uL 0096500430 3 182 TCGA-A6-2678-01A TCGA-A6-2678-01A-01D-0820-05 26.7 uL 0096500407 4 334 TCGA-A6-2683-01A TCGA-A6-2683-01A-01D-0820-05 26.7 uL 0096500406 5 1186 TCGA-AA-3514-01A TCGA-AA-3514-01A-02D-0820-05 26.7 uL 0096500383 6 1254 TCGA-AA-3517-01A TCGA-AA-3517-01A-01D-0820-05 26.7 uL 0096500382 concentration shipment plate_column plate_row batch shortID 1 0.14 ug/uL 22-2-2010 1 B 0820 TCGA-A6-2674 2 0.15 ug/uL 22-2-2010 1 C 0820 TCGA-A6-2677 3 0.15 ug/uL 22-2-2010 1 D 0820 TCGA-A6-2678 4 0.16 ug/uL 22-2-2010 1 E 0820 TCGA-A6-2683 5 0.15 ug/uL 22-2-2010 1 F 0820 TCGA-AA-3514 6 0.13 ug/uL 22-2-2010 1 G 0820 TCGA-AA-3517{code} Ignore bcr_sample_barcode, bcr_aliquot_barcode, these are not tech variables. Here are the correlation of the tech variables with the first 4 PCs. {code}> k bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment V1 0.4700026 0.4877315 0.04672452 0.3017151 1.601134e-07 8.334984e-34 V2 0.4510048 0.4877315 0.48950693 0.5316963 3.392774e-01 5.699460e-03 V3 0.4510048 0.4877315 0.18777379 0.6925323 1.141026e-01 7.733551e-02 V4 0.4513158 0.4877315 0.58958142 0.6705972 6.211816e-01 6.664406e-02 plate_column plate_row batch shortID V1 3.009061e-05 0.001491552 2.491659e-33 0.4702436 V2 1.947964e-01 0.989450736 1.349671e-04 0.4511082 V3 9.369050e-01 0.463424193 1.437189e-01 0.4511082 V4 3.395272e-01 0.688070623 1.012497e-01 0.4514159 {code} Outliers of the first principal component: !Colon_Mval_noNorm_PC1outliers.png|thumbnail! Seems that the batch has the biggest effect on the data. Remove the batch, relative variance after that: !Colon_RelativeVariance_noBatch.png|thumbnail! Calculate correlation with the clinical variables after removing the batch effect: {code}> k bcr_sample_barcode bcr_aliquot_barcode amount biospecimen_barcode_bottom concentration shipment V1 0.4691690 0.4877315 0.6106733 0.5489680 0.4263794 0.9956323 V2 0.4534001 0.4877315 0.8880134 0.7069529 0.4187061 0.9950998 V3 0.4536645 0.4877315 0.8589996 0.7271536 0.4000974 0.9856611 V4 0.4543147 0.4877315 0.8880134 0.6459688 0.9405449 0.9999056 plate_column plate_row batch shortID V1 0.04191285 0.5627797 0.9981536 0.4690814 V2 0.97317059 0.6821160 0.9999954 0.4536160 V3 0.55783761 0.0703874 0.9992998 0.4537400 V4 0.52248494 0.8718225 0.9999849 0.4543834{code} Seems that this took care of all the problems, there is still a borderline significant correlation of the PC1 with the plate column, not going to worry about it. Look at the outliers of the first principal component after removing the batch: !Colon_batchRemoved_PC1outliers.png|thumbnail! Conclusion: consider the data to be normalized. |
Page Comparison
General
Content
Integrations
App links