Normalization of cancer DNA methylation data (27k array platform)

Normalization of cancer DNA methylation data (27k array platform)

Download fromt TCGA: level 1, public tumor matched and unmatched. No normal samples. Number of patients: 168. One patient didn't have any clinical information TCGA-A6-2670 (TCGA-A6-2670-01A-02D-0820-05), another patient seem to be processed twice in 2 different batches: TCGA-AA-3970 (TCGA-AA-3970-01A-01D-1110-05; TCGA-AA-3970-01A-01D-1110-05). Removed this one: TCGA-AA-3970-01A-01D-1020-05. Ended up with 166 patients to do normalization. Put everything in expr set object + clinical information + processing batch (obtained from sdrf file).

Goal: normalize the data, create 2 datasets: no batch, no batch,age,gender.

Unnormalized data, correlation with the PCs:

PC

Processing Batch

Lymphatic Invasion

Tumor Stage

PC

Processing Batch

Lymphatic Invasion

Tumor Stage

1

2.2e-16

0.00138

0.6281

2

0.0001087

0.02128

0.0227

3

0.06903

0.1802

0.2403

4

0.06703

0.805

0.4586

Since biological variables are not highly correlated with the first few principal components I will build 2 model matrices: one with only batch and the other one with batch, age and gender and create two normalized datasets. 

Relative variance of the raw data:

#Remove the batch only > X<-model.matrix(~factor(pData(exprs)$batchComb)) > bch<-solve(t(X) %*% X) %*% t(X) %*% t(exprs(exprs)) > resBatch<-exprs(exprs) - t(X %*% bch) #Remove batch and age/gender > X<-model.matrix(~factor(pData(exprs)$batchComb)+pData(exprs)$age_at_initial_pathologic_diagnosis+factor(pData(exprs)$gender)) > crap<-solve(t(X) %*% X) %*% t(X) %*% t(exprs(exprs)) > resCrap<-exprs(exprs) - t(X %*% crap)

Relative variance after removing the batch and batch/age/gender:

Found a few outliers in both datasets:

#In the dataset where only batch is removed: > x<-u1$v[,1] > names(x)<-colnames(resBatch) > names(boxplot(x,main="Colon Tumor 166 pat, batch removed, PC1")$out) [1] "TCGA-AA-A00J" "TCGA-AA-A02R" "TCGA-AA-3877" "TCGA-AA-3947" #In the dataset where age/gender/batch are removed > x<-u2$v[,1] > names(x)<-colnames(resCrap) > names(boxplot(x)$out) [1] "TCGA-AA-3556" "TCGA-AA-A00J" "TCGA-AA-A02R" "TCGA-AA-3947"

Visual representation of the outliers, box plot of the first PC of the normalized data:

Result: created 2 expr Sets with the normalized data.