...
Important update (January 20th, 2011): the data below have been corrected for the BCR batch which is not necessarily the processing batch. The dataset needs to be reanalyzed.
Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012):
Batch vs clinical traits
Clinical traits: 36, number of batches: 13
Batch vs center:
Code Block | ||
---|---|---|
| ||
> table(batchID,two) two batchID A3 AK AS B0 B2 B4 B8 BP CJ CW CZ DV EU 0859 31 8 2 0 0 0 0 0 0 0 0 0 0 1186 4 6 0 0 5 0 6 5 9 0 0 0 0 1275 0 12 0 29 1 0 1 0 0 0 0 0 0 1284 0 0 0 0 0 0 0 50 0 0 0 0 0 1303 0 0 0 6 0 0 0 11 24 0 6 0 0 1323 18 7 0 0 4 0 3 5 9 0 0 0 0 1332 0 0 0 6 0 0 0 39 2 0 0 0 0 1418 6 0 0 27 0 0 6 8 0 0 0 0 0 1424 0 0 0 0 0 0 0 28 16 0 3 0 0 1500 0 1 0 15 0 2 1 1 0 0 24 0 0 1536 2 0 0 18 5 0 5 0 13 9 0 9 0 1551 0 0 0 0 0 0 3 0 0 0 0 0 0 1670 0 0 0 6 0 7 4 0 7 6 7 0 4{code} |
Significant
...
batch/trait
...
correlations
...
(complete
...
table
...
can
...
be
...
found
...
...
):
Survival vs Batch
Summary can be found here, batch is significantly correlated with survival:
Likelihood ratio test= 61.35 on 10 df, p=2.007e-09
Wald test = 64.35 on 10 df, p=5.39e-10
Score (logrank) test = 75.35 on 10 df, p=4.066e-12
DNA methylation data analysis
27k dataset, downloaded on December 28, 2011. 219 samples. Note: TCGA is terrible about their standards. I am extracting values for methylated and unmethylated probes from the files for each patient. For this dataset it is 1st and 4th columns. However, for GBM it is 1st and 2nd columns! Unreliable. It seems that the data for GBM was processed differently because standard deviation and the number of beads are missing for GBM. However I noticed that they actually provide negative controls intensity for the green and red dyes.
Technical variables available: batch, amount, concentration, day of shipment, month of shipment, year of shipment, plate row, plate column. Combine day, month and year in a single variable. Info about technical variables:
Code Block | ||
---|---|---|
| ||
> head(methNew) batchID amount concentration plate_column plate_row dateCombined 2 0859 26.7 uL 0.14 ug/uL 1 A 17-3-2010 32 0859 26.7 uL 0.17 ug/uL 1 C 17-3-2010 59 0859 26.7 uL 0.15 ug/uL 1 D 17-3-2010 84 0859 26.7 uL 0.15 ug/uL 1 E 17-3-2010 > table(methNew$batchID) 0859 1186 1284 1303 1332 40 35 50 47 47 > table(methNew$amount) 26.7 uL 219 > table(methNew$concentration) 0.13 ug/uL 0.14 ug/uL 0.15 ug/uL 0.16 ug/uL 0.17 ug/uL 7 50 122 30 10 > table(methNew$plate_column) 1 2 3 4 5 6 7 39 40 40 40 35 23 2 > table(methNew$plate_row) A B C D E F G H 30 28 28 27 27 27 27 25 > table(methNew$plate_column,methNew$plate_row) A B C D E F G H 1 5 4 5 5 5 5 5 5 2 5 5 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 6 4 3 3 3 3 3 3 1 7 1 1 0 0 0 0 0 0 > table(methNew$dateCombined) 11-10-2010 17-3-2010 25-8-2010 27-9-2010 6-10-2010 47 40 35 50 47 > table(methNew$dateCombined,methNew$batchID) 0859 1186 1284 1303 1332 11-10-2010 0 0 0 > table(methNew$amount) 10.30 uL 47 10 uL 11.2 uL 11 uL 12.4 uL 13.2 uL 13.3 uL 15.7 uL 15 uL 16.1 uL 17-3-2010 40 0 0 0 0 25-8-2010 0 35 0 0 0 27-9-2010 0 0 050 0 0 06-10-2010 0 0 0 16.3 uL 16.747 uL 16 uL 17 uL 19 uL 20.9 uL 20 uL 21.5 uL 22 uL 23.7 uL 0 0 0 0 0 0 0 0 0 0 25 uL 26.7 uL 30 uL 40 uL 5 uL 60 uL 61 uL 63.1 uL 66.6 uL 6.67 uL 0 219 0 0 0 0 0 0 0 0 66.7 uL 6.7 uL 7.2 uL 80 uL 8.9 uL 0 0 0 0 0 #It seems that all values of the amount are 26.7 (although I have factor levels from the all values available for future DNA methylation datasets for patients for whom samples are already collected) > table(methNew$concentration) 0.01 ug/uL 0.03 ug/uL 0.04 ug/uL 0.0500 ug/uL 0.050 ug/uL 0.05 ug/uL 0 0 0 0 0 0 0.09 ug/uL 0.100 ug/uL 0.10 ug/uL 0.11 ug/uL 0.12 ug/uL 0.13 ug/uL 0 0 0 0 0 7 0.14 ug/uL 0.15 ug/uL 0.16 ug/uL 0.17 ug/uL 0.1 ug/uL 0.50 ug/uL 50 122 30 10 0 0 .05 ug/uL 0.5 ug/uL .100 ug/uL .150 ug/uL .1 ug/uL .50 ug/uL 0 0 0 0 0 0 .5 ug/uL 0 > table(methNew$plate_column) 1 2 3 4 5 6 7 39 40 40 40 35 23 2 > table(methNew$plate_row) A B C D E F G H 30 28 28 27 27 27 27 25 > table(methNew$plate_row,methNew$plate_column) 1 2 3 4 5 6 7 A 5 5 5 5 5 4 1 B 4 5 5 5 5 3 1 C 5 5 5 5 5 3 0 D 5 5 5 5 4 3 0 E 5 5 5 5 4 3 0 F 5 5 5 5 4 3 0 G 5 5 5 5 4 3 0 H 5 5 5 5 4 1 0 > table(methNew$dateCombined) 11-10-2010 17-3-2010 25-8-2010 27-9-2010 6-10-2010 47 40 35 50 47 > table(methNew$dateCombined,methNew$batchID) 0859 1186 1275 1284 1303 1323 1332 1418 1424 1500 1536 1551 1670 11-10-2010 0 0 0 0 0 0 47 0 0 0 0 0 0 17-3-2010 40 0 0 0 0 0 0 0 0 0 0 0 0 25-8-2010 0 35 0 0 0 0 0 0 0 0 0 0 0 27-9-2010 0 0 0 50 0 0 0 0 0 0 0 0 0 6-10-2010 0 0 0 0 47 0 0 0 0 0 0 0 0{code} 0 |
Exclude "amount" from calculations for the correlations of the first principal components of the data with the technical variables.
Created a matrix of M values, didn't split read and green. Relative variance, no normalization and the outliers:
Based on the plot will look at the first 8 principal components:
Code Block | ||
---|---|---|
| ||
batchID concentration plate_column plate_row dateCombined
V1 2.024556e-22 0.5182919 0.22249235 0.9371285 2.024556e-22
V2 1.777673e-18 0.2878497 0.40175378 0.6195123 1.777673e-18
V3 3.196508e-01 0.3802798 0.27628233 0.5517096 3.196508e-01
V4 1.693859e-30 0.2449447 0.50367703 0.9672545 1.693859e-30
V5 2.435091e-03 0.1812444 0.08644977 0.5581507 2.435091e-03
V6 4.437547e-03 0.9473683 0.15938639 0.8458098 4.437547e-03
V7 1.271181e-03 0.3644802 0.79816984 0.7038321 1.271181e-03
V8 1.051940e-05 0.5905213 0.28713862 0.2173504 1.051940e-05 |
Batch and dateCombined are highly correlated with the first principal components (V1 - V8 are the principal components after performing an SVD on unnormalized matrix)
Start by removing the batch. Relative variance and the outliers after removing the batch.
Yikes.
Correlation with the first principal components:
Code Block | ||
---|---|---|
| ||
batchID concentration plate_column plate_row dateCombined
V1 0.9717423 0.8262431 0.18591881 0.8304766 0.9717423
V2 0.9976239 0.4612353 0.34203646 0.3816463 0.9976239
V3 0.9578584 0.9056604 0.12948457 0.1792408 0.9578584
V4 0.9043202 0.4152433 0.02150515 0.6264030 0.9043202
V5 0.9991262 0.8505841 0.19052765 0.6834312 0.9991262
V6 0.8956311 0.1123490 0.55257726 0.7618414 0.8956311
V7 0.9991696 0.7699433 0.84761783 0.2805982 0.9991696
V8 0.9939025 0.6395495 0.44489016 0.6334089 0.9939025 |
Consider the data to be normalized.
eSet object is available.