Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{color:#ff0000}{_}Important update: the BCR batch that is considered here is not the processing batch and it shouldn't be even considered as a surrogate variable for data normalziation. All three batches were processed simultaneously. _{color}

h4. Correlation between BCR batch and the processing batch for 27k arrays (January 20, 2012)


{csv}Batch on the download page,"# after ""HumanMethylation27k"" in the file name, Level 2 data",Batch as the sixth field in the patient barcode,Comments
Batch 25,1,"0741, 0742, 0743","Level 1 data is uploaded again as .idat files split into green and red probes, I can't figure out how to get batch from the file names. Now, however, they provide slide number and the array letter!"{csv}


h5. Analysis of batch vs clinical traits

Number of clinical traits: 23

Number of batches based on the pattern for DNA methylation samples: 3

Correlation between batch and center in AML:
{code}> table(batchID,two)
       two
batchID AB
   0741 96
   0742 73
   0743 25{code}
There is only one center from which all patients come from. 
Significant trait - batch correlations (all other correlations can be found in a table [here|^BatchClinicalInfoCorrelationsLAML.csv])
{csv}"LAML_clinical_traits","DataType","NumberOfNAs","Test","Pvalue"
"days_to_form_completion","integer",2,"Kruskal-Wallis rank sum test",1.22E-26
"year_of_initial_pathologic_diagnosis","integer",2,"Kruskal-Wallis rank sum test",2.53E-26
"days_to_death","integer",82,"Kruskal-Wallis rank sum test",3.94E-04
"prior_diagnosis","factor",2,"Pearson's Chi-squared test",1.63E-03
"vital_status","factor",2,"Pearson's Chi-squared test",5.25E-03
"age_at_initial_pathologic_diagnosis","integer",2,"Kruskal-Wallis rank sum test",1.66E-02
"days_to_birth","integer",2,"Kruskal-Wallis rank sum test",1.90E-02
"hydroxyurea_administration_prior_registration_clinical_study_indicator","factor",2,"Pearson's Chi-squared test",3.05E-02
"pretreatment_history","factor",2,"Pearson's Chi-squared test",3.05E-02
{csv}

h5. Survival analysis
{code:collapse=true}> death<-clinical[,4]
> vital<-clinical[,22]
> fup<-clinical[,7]
> x<-cbind(vital,death,fup)
> rownames(x)<-rownames(clinical)
> dim(x[is.na(x[, 2]) & is.na(x[, 3]), ])
[1] 14  3
> mask <- is.na(x[, 2]) & is.na(x[, 3]) #Exclude patients for whom there is no information for days to death or days to the last follow-up, total of 14 patients
> x1 <- x[!mask, ]
> dim(x1)
[1] 188   3
> status <- rep(1, 188) #create censoring indicator
> status[which(is.na(x1[, 2]))] <- 0
> x1[is.na(x1[, 2]), 2] <- x1[which(is.na(x1[, 2]), 2), 3] #Patients that don't have days to death get days to the last follow-up and status is 0
> x2 <- cbind(x1, status)
> k<-match(rownames(x2),meth[,1])
> methK<-meth[k,]
> library(survival)
Loading required package: splines
> surv <- Surv(as.numeric(x2[, 2]), as.numeric(x2[, 4])) #Create survival object
> plot(survfit(surv ~ 1), xlab = "days to death", ylab = "Probability", main = "Kaplan-Meier survival curve \n for TCGA AML (188 patients)")
> plot(survfit(surv ~ methK[, 2]), xlab = "days to death", ylab = "Probability", col = 1:3, main = "TCGA AML survival correlation with batch")
> legend(2000, 0.6, levels(as.factor(methK[, 2])), text.col = 1:3)
> summary(coxph(surv ~ methK[, 2]))
Call:
coxph(formula = surv ~ methK[, 2])

  n= 182, number of events= 116
   (6 observations deleted due to missingness)

                  coef exp(coef) se(coef)      z Pr(>|z|)
methK[, 2]0742 -0.2311    0.7937   0.2139 -1.080 0.279990
methK[, 2]0743  1.1784    3.2492   0.3541  3.328 0.000874 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

               exp(coef) exp(-coef) lower .95 upper .95
methK[, 2]0742    0.7937     1.2599    0.5219     1.207
methK[, 2]0743    3.2492     0.3078    1.6233     6.504

Concordance= 0.563  (se = 0.026 )
Rsquare= 0.06   (max possible= 0.997 )
Likelihood ratio test= 11.17  on 2 df,   p=0.003754
Wald test            = 14.35  on 2 df,   p=0.0007656
Score (logrank) test = 16.2  on 2 df,   p=0.0003037{code}


h5. !LAML_KaplanMeierCurve.png|thumbnail! !LAML_survivalVSbatch.png|thumbnail!

Looks like batch correlates with patient survival.


h5. DNA methylation data analysis

Was downloaded on December 19th, 450k arrays, 190 patients. However, it still was an old way of delivering the data when they process the files through Genome Studio (or Bead Studio) (not in idat format) and deliver somewhat normalized data (my understanding that this within array normalization and not between arrays). &nbsp;I combined values for the mathylated and unmethylated probes into the M value. For AML I don't have any technical variables that may affect the data besides the batch effect. For some reason information for amount, concentration, day-month-year plate row and column is completely missing in the clinical_aliquot_public.txt file. &nbsp;Do SVD and look whether batch is correlated with the first principal component.&nbsp;

Relative variance before normalization:

!LAML_Mvalue_RelVariance_beforeNorm.png|thumbnail!

P values of the Kruskal-Wallis test for the correlation between the batch and the first 7 principal components:
|| || PC1 || PC2 || PC3 || PC4 || PC5 || PC6 || PC7 ||
| P value | 0.8677 \\ | 0.3977 \\ | 0.5291 \\ | 0.4125 \\ | 0.9755 \\ | 0.5249 \\ | 0.2309 \\ |

It is obvious that batch is not correlated with the first principal components (total variance explained by the first 7 is 42%).

Outliers of the first principal component:
!LAML_Mval_noNorm_PC1_outliers.png|thumbnail!
Wow.
However, the data won't be corrected for any technical variables.&nbsp;

eSet object was created for this data set. phenoData with clinical traits was included as well as CpG annotation based on TCGA IlluminaHumanMethylation450k adf file (can be downloaded from [here|http://tcga-data.nci.nih.gov/tcga/tcgaPlatformDesign.jsp]) (featureData). Strangely, 3103 probes in the data matrix were missing from adf file although the total number of probes in the adf file was bigger. I may want to try using Bioconductor package for annotation of these arrays.&nbsp;