DNA methylation and gene expression analysis for RAS AZ

DNA methylation and gene expression analysis for RAS AZ

Summary and and analysis plan

The goal is to curate DNA methylation data (matched to gene expression), identify differentially methylated regions (come up or research a method) and identify cis- and trans- correlations with gene expression.

  1. Justin shared his normalized gene expression eSet with me (Agilent, corrected for batch, age and gender)

  2. Obtain raw DNA methylation data for the same number of patients (27k arrays)

  3. Normalize it: remove batch, remove outliers

  4. Identify differentially methylated CpGs (F-test or as a mixture of 3 distributions (Need to find the paper))

  5. Do all pairwise correlation with gene expression. Calculated FDR separately for cis and trans associations

Data preprocessing

The data was downloaded to one folder, total of 255 samples. Both READ and COAD have the same file format for the Level 1 data: methylated intensities in the first column, unmethylated intensities are in the forth column. Use MakesMethyLumiRobject.R to process the files and create a single matrix. Tumor samples (based on tCGA barcode) are 224, normal samples are 31. Data distribution by sample type (normal/tumor) and by BCR batch:

It is clear that BCR batch has a strong effect of the data intensity. Do SVD:

Looks like both batch and the sample type are highly correlated with the PC1: 

> kruskal.test(methU$v[,1],as.factor(plotColor)) #plotColor is the vector I used to label tumor and normal samples for the data distribution Kruskal-Wallis rank sum test data: methU$v[, 1] and as.factor(plotColor) Kruskal-Wallis chi-squared = 43.0382, df = 1, p-value = 5.368e-11 > kruskal.test(methU$v[,1],as.factor(batch)) Kruskal-Wallis rank sum test data: methU$v[, 1] and as.factor(batch) Kruskal-Wallis chi-squared = 193.7021, df = 9, p-value < 2.2e-16 > table(batch,plotColor) # relationship between normal samples and the batch       plotColor batch  blue red   0820    0  46   0825    0  17   0904    0  39   1020    4  48   1110    0   6   1116    1  18   1551   17   0   1552    9   0   A004    0  24   A00B    0  26

Also, it looks like normal samples have one outlier. I think the procedure that should be applied here is snm because we definitely want to retain normal/tumor and remove batch. In addition it seems that BCR batch and the sample type are correlated p value= 2.2e-16 (performed chi-square test, although p values are not exact since for two batches I have only 4 and 1 samples). Next I need to investigate to see how the processing batch affects the data (processing batch information will be obtained from .sdrf files). 

Relationship for the batches based on TCGA download page for READ:


Relationship between BCR batches and TCGA Archive name for READ:

 

> table(read$Comment..TCGA.Archive.Name.,read$readBatch) 0820 0825 0904 1116 1552 A004 A00B 1.2.0 17 0 0 0 4 0 0 2.1.0 0 11 0 0 0 0 0 3.1.0 0 0 0 0 0 10 0 4.1.0 0 0 0 0 0 0 10 5.1.0 0 0 9 0 0 0 0 6.0.0 0 0 0 20 0 0 0

TCGA archive name doesn't correspond exactly to the Batch numbers on the download page. The batch numbers on the download page (Data Access Matrix) seem to correspond to the sample batches sent out by BCR centers.

 


And here is a table again that shows relationship between TCGA Archive name and BCR batches:

 

> table(coad$Comment..TCGA.Archive.Name.,coad$coadBatch) 0820 0825 0904 1020 1110 1551 1552 A004 A00B A081 1.1.0 32 0 0 0 0 12 5 0 0 0 2.2.0 0 8 0 0 0 5 0 0 0 0 3.0.0 0 0 0 0 0 0 0 16 0 0 4.0.0 0 0 0 0 0 0 0 0 18 0 5.2.0 0 0 32 0 0 0 0 0 0 0 6.0.0 0 0 0 53 0 0 0 0 0 0 7.0.0 0 0 0 0 7 0 0 0 0 0 8.0.0 0 0 0 0 0 0 0 0 0 25

Use TCGA Archive Name as a surrogate for the batch effect in this colon dataset. Next: normalize the data and identify differentially methylated loci. In addition, TCGA Archive Names are unique and specific to different types of cancer. 

 Normalization and identification of differentially methylated loci

First, extract TCGA Archive name from sdrf files for READ and COAD and recode them and them merge the sdrf files together. A-F are from READ, G-M are from COAD. Correlation between BCR batches and the processing batches (=TCGA Archive Names):

> table(colonM[,5]) A B C D E F G H I J K L M 20 10 9 9 8 19 47 12 15 17 31 52 6 > table(colonM$BCRbatch,colonM$ArchiveBatchCode) A B C D E F G H I J K L M 0820 16 0 0 0 0 0 30 0 0 0 0 0 0 0825 0 10 0 0 0 0 0 7 0 0 0 0 0 0904 0 0 0 0 8 0 0 0 0 0 31 0 0 1116 0 0 0 0 0 19 0 0 0 0 0 0 0 1552 4 0 0 0 0 0 5 0 0 0 0 0 0 A004 0 0 9 0 0 0 0 0 15 0 0 0 0 A00B 0 0 0 9 0 0 0 0 0 17 0 0 0 1020 0 0 0 0 0 0 0 0 0 0 0 52 0 1110 0 0 0 0 0 0 0 0 0 0 0 0 6 1551 0 0 0 0 0 0 12 5 0 0 0 0 0

Perform SVD on the combined DNA methylation data and look at the correlation with the processing batch:

PC

P value

PC

P value

1

2.2e-16

2

0.0002451

3

1.051e-10

4

0.02574

In addition processing batch is also highly correlated with the sample type (tumor/normal), Chi-square test p value: 4.727e-07. However it looks like in the batch G there is a decent number of normal samples. If the batch effect is very strong I may need to work only with the batch G to identify differentially methylated regions (DMRs). First, I will remove all batches that don't have any normal samples (B,C, D, E, I, J, K, L) and identify effect of the processing batch on the remaining dataset. Distribution of normal and tumor samples among remained batches:

>table(colonNormal$ArchiveBatchCode,samTypeNorm) samTypeNorm normal tumor A 4 16 F 1 18 G 17 30 H 5 7 L 4 48

Plotting PC1 of this dataset by batch showed that the batch F is an outlier.


Also, there remains a high correlation of the batch with the sample type, Chi-square test: 0.002922, Fisher test: 0.001395. Remove the batch F (left with 131 samples)


SVD, PCs correlation with batch:

PC

P value for batch

P value for sample type

PC

P value for batch

P value for sample type

1

0.0007655

1.467e-14

2

2.422e-07

3.261e-08

3

7.927e-09

 

5

0.09505