Normalization of cancer DNA methylation data (27k array platform)

Download fromt TCGA: level 1, public tumor matched and unmatched. No normal samples. Number of patients: 168. One patient didn't have any clinical information TCGA-A6-2670 (TCGA-A6-2670-01A-02D-0820-05), another patient seem to be processed twice in 2 different batches: TCGA-AA-3970 (TCGA-AA-3970-01A-01D-1110-05; TCGA-AA-3970-01A-01D-1110-05). Removed this one: TCGA-AA-3970-01A-01D-1020-05. Ended up with 166 patients to do normalization. Put everything in expr set object + clinical information + processing batch (obtained from sdrf file).

Goal: normalize the data, create 2 datasets: no batch, no batch,age,gender.

Unnormalized data, correlation with the PCs:

PC	Processing Batch	Lymphatic Invasion	Tumor Stage
1	2.2e-16	0.00138	0.6281
2	0.0001087	0.02128	0.0227
3	0.06903	0.1802	0.2403
4	0.06703	0.805	0.4586

Since biological variables are not highly correlated with the first few principal components I will build 2 model matrices: one with only batch and the other one with batch, age and gender and create two normalized datasets.

Relative variance of the raw data:

#Remove the batch only
> X<-model.matrix(~factor(pData(exprs)$batchComb))
> bch<-solve(t(X) %*% X) %*% t(X) %*% t(exprs(exprs))
> resBatch<-exprs(exprs) - t(X %*% bch)


#Remove batch and age/gender
> X<-model.matrix(~factor(pData(exprs)$batchComb)+pData(exprs)$age_at_initial_pathologic_diagnosis+factor(pData(exprs)$gender))
> crap<-solve(t(X) %*% X) %*% t(X) %*% t(exprs(exprs))
> resCrap<-exprs(exprs) - t(X %*% crap)

Relative variance after removing the batch and batch/age/gender:

Found a few outliers in both datasets:

#In the dataset where only batch is removed:
> x<-u1$v[,1]
> names(x)<-colnames(resBatch)
> names(boxplot(x,main="Colon Tumor 166 pat, batch removed, PC1")$out)
[1] "TCGA-AA-A00J" "TCGA-AA-A02R" "TCGA-AA-3877" "TCGA-AA-3947"
#In the dataset where age/gender/batch are removed
> x<-u2$v[,1]
> names(x)<-colnames(resCrap)
> names(boxplot(x)$out)
[1] "TCGA-AA-3556" "TCGA-AA-A00J" "TCGA-AA-A02R" "TCGA-AA-3947"

Visual representation of the outliers, box plot of the first PC of the normalized data:

Result: created 2 expr Sets with the normalized data.