Important update (January 20th, 2011): the data below have been corrected for the BCR batch which is not necessarily the processing batch. The dataset needs to be reanalyzed.
Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012):
Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 2 data",Batch as the sixth field in the patient barcode,Comments
Batch 32,1,0859,"Level 1 data is uploaded again as .idat files split into green and red probes, I can't figure out how to get batch from the file names. Now, however, they provide slide number and the array letter!"
Batch 50,2,1186,
Batch 63,no data,,
Batch 64,3,"1287, 1284",
Batch 65,4,1303,
Batch 68,no data,,
Batch 69,5,1332,
Batch 70,no data,,
Batch 82,no data,,
Batch 90,no data,,
Batch 105,no data,,
Batch vs clinical traits
Clinical traits: 36, number of batches: 13
Batch vs center:
Significant batch/trait correlations (complete table can be found here):
KIRC_clinical_traits,DataType,NumberOfNAs,Test,Pvalue
white_cell_count_result,factor,82,Pearson's Chi-squared test,2.09E-13
serum_calcium_result,factor,160,Pearson's Chi-squared test,8.31E-13
tumor_stage,factor,21,Pearson's Chi-squared test,2.11E-11
tumor_grade,factor,5,Pearson's Chi-squared test,6.43E-09
vital_status,factor,0,Pearson's Chi-squared test,9.62E-09
days_to_form_completion,integer,0,Kruskal-Wallis rank sum test,1.16E-07
year_of_initial_pathologic_diagnosis,integer,0,Kruskal-Wallis rank sum test,1.38E-07
days_to_last_known_alive,integer,10,Kruskal-Wallis rank sum test,8.41E-07
days_to_last_followup,integer,4,Kruskal-Wallis rank sum test,1.94E-06
distant_metastasis_pathologic_spread,factor,11,Pearson's Chi-squared test,2.23E-06
primary_tumor_pathologic_spread,factor,0,Pearson's Chi-squared test,3.63E-06
person_neoplasm_cancer_status,factor,28,Pearson's Chi-squared test,4.26E-06
hemoglobin_result,factor,71,Pearson's Chi-squared test,2.66E-04
lymphnode_pathologic_spread,factor,2,Pearson's Chi-squared test,7.85E-04
lymphnodes_examined_prior_presentation,factor,43,Pearson's Chi-squared test,2.05E-03
gender,factor,0,Pearson's Chi-squared test,2.10E-02
age_at_initial_pathologic_diagnosis,integer,0,Kruskal-Wallis rank sum test,2.51E-02
days_to_birth,integer,8,Kruskal-Wallis rank sum test,2.87E-02
prior_diagnosis,factor,0,Pearson's Chi-squared test,4.75E-02
Survival vs Batch
Summary can be found here, batch is significantly correlated with survival:
Likelihood ratio test= 61.35 on 10 df, p=2.007e-09
Wald test = 64.35 on 10 df, p=5.39e-10
Score (logrank) test = 75.35 on 10 df, p=4.066e-12
DNA methylation data analysis
27k dataset, downloaded on December 28, 2011. 219 samples. Note: TCGA is terrible about their standards. I am extracting values for methylated and unmethylated probes from the files for each patient. For this dataset it is 1st and 4th columns. However, for GBM it is 1st and 2nd columns! Unreliable. It seems that the data for GBM was processed differently because standard deviation and the number of beads are missing for GBM. However I noticed that they actually provide negative controls intensity for the green and red dyes.
Technical variables available: batch, amount, concentration, day of shipment, month of shipment, year of shipment, plate row, plate column. Combine day, month and year in a single variable. Info about technical variables:
Exclude "amount" from calculations for the correlations of the first principal components of the data with the technical variables.
Created a matrix of M values, didn't split read and green. Relative variance, no normalization and the outliers:
Based on the plot will look at the first 8 principal components:
Batch and dateCombined are highly correlated with the first principal components (V1 - V8 are the principal components after performing an SVD on unnormalized matrix)
Start by removing the batch. Relative variance and the outliers after removing the batch.
Yikes.
Correlation with the first principal components:
Consider the data to be normalized.
eSet object is available.