Batch vs clinical traits
Clinical traits: 36, number of batches: 13
Batch vs center:
Significant batch/trait correlations (complete table can be found here):
KIRC_clinical_traits,DataType,NumberOfNAs,Test,Pvalue
white_cell_count_result,factor,82,Pearson's Chi-squared test,2.09E-13
serum_calcium_result,factor,160,Pearson's Chi-squared test,8.31E-13
tumor_stage,factor,21,Pearson's Chi-squared test,2.11E-11
tumor_grade,factor,5,Pearson's Chi-squared test,6.43E-09
vital_status,factor,0,Pearson's Chi-squared test,9.62E-09
days_to_form_completion,integer,0,Kruskal-Wallis rank sum test,1.16E-07
year_of_initial_pathologic_diagnosis,integer,0,Kruskal-Wallis rank sum test,1.38E-07
days_to_last_known_alive,integer,10,Kruskal-Wallis rank sum test,8.41E-07
days_to_last_followup,integer,4,Kruskal-Wallis rank sum test,1.94E-06
distant_metastasis_pathologic_spread,factor,11,Pearson's Chi-squared test,2.23E-06
primary_tumor_pathologic_spread,factor,0,Pearson's Chi-squared test,3.63E-06
person_neoplasm_cancer_status,factor,28,Pearson's Chi-squared test,4.26E-06
hemoglobin_result,factor,71,Pearson's Chi-squared test,2.66E-04
lymphnode_pathologic_spread,factor,2,Pearson's Chi-squared test,7.85E-04
lymphnodes_examined_prior_presentation,factor,43,Pearson's Chi-squared test,2.05E-03
gender,factor,0,Pearson's Chi-squared test,2.10E-02
age_at_initial_pathologic_diagnosis,integer,0,Kruskal-Wallis rank sum test,2.51E-02
days_to_birth,integer,8,Kruskal-Wallis rank sum test,2.87E-02
prior_diagnosis,factor,0,Pearson's Chi-squared test,4.75E-02
Survival vs Batch
Summary can be found here, batch is significantly correlated with survival:
Likelihood ratio test= 61.35 on 10 df, p=2.007e-09
Wald test = 64.35 on 10 df, p=5.39e-10
Score (logrank) test = 75.35 on 10 df, p=4.066e-12
DNA methylation data analysis
27k dataset, downloaded on December 28, 2011. 219 samples. Technical variables available: batch, amount, concentration, day of shipment, month of shipment, year of shipment, plate row, plate column. Combine day, month and year in a single variable. Info about technical variables:
Exclude "amount" from calculations for the correlations of the first principal components of the data with the technical variables.
Created a matrix of M values, didn't split read and green. Relative variance, no normalization and the outliers:
Based on the plot will look at the first 8 principal components:
Batch and dateCombined are highly correlated with the first principal components (V1 - V8 are the principal components after performing an SVD on unnormalized matrix)
Start by removing the batch. Relative variance and the outliers after removing the batch.