Normalization of cancer DNA methylation data (27k array platform)
Download fromt TCGA: level 1, public tumor matched and unmatched. No normal samples. Number of patients: 168. One patient didn't have any clinical information TCGA-A6-2670 (TCGA-A6-2670-01A-02D-0820-05), another patient seem to be processed twice in 2 different batches: TCGA-AA-3970 (TCGA-AA-3970-01A-01D-1110-05; TCGA-AA-3970-01A-01D-1110-05). Removed this one: TCGA-AA-3970-01A-01D-1020-05. Ended up with 166 patients to do normalization. Put everything in expr set object + clinical information + processing batch (obtained from sdrf file).
Goal: normalize the data, create 2 datasets: no batch, no batch,age,gender.
Unnormalized data, correlation with the PCs:
PC | Processing Batch | Lymphatic Invasion | Tumor Stage |
|---|---|---|---|
1 | 2.2e-16 | 0.00138 | 0.6281 |
2 | 0.0001087 | 0.02128 | 0.0227 |
3 | 0.06903 | 0.1802 | 0.2403 |
4 | 0.06703 | 0.805 | 0.4586 |
Since biological variables are not highly correlated with the first few principal components I will build 2 model matrices: one with only batch and the other one with batch, age and gender and create two normalized datasets.
Relative variance of the raw data:
Relative variance after removing the batch and batch/age/gender:
Found a few outliers in both datasets:
Visual representation of the outliers, box plot of the first PC of the normalized data:
Result: created 2 expr Sets with the normalized data.