DNA methylation and gene expression analysis for RAS AZ

Summary and and analysis plan

The goal is to curate DNA methylation data (matched to gene expression), identify differentially methylated regions (come up or research a method) and identify cis- and trans- correlations with gene expression.

Justin shared his normalized gene expression eSet with me (Agilent, corrected for batch, age and gender)
Obtain raw DNA methylation data for the same number of patients (27k arrays)
Normalize it: remove batch, remove outliers
Identify differentially methylated CpGs (F-test or as a mixture of 3 distributions (Need to find the paper))
Do all pairwise correlation with gene expression. Calculated FDR separately for cis and trans associations

Data preprocessing

The data was downloaded to one folder, total of 255 samples. Both READ and COAD have the same file format for the Level 1 data: methylated intensities in the first column, unmethylated intensities are in the forth column. Use MakesMethyLumiRobject.R to process the files and create a single matrix. Tumor samples (based on tCGA barcode) are 224, normal samples are 31. Data distribution by sample type (normal/tumor) and by BCR batch:

It is clear that BCR batch has a strong effect of the data intensity. Do SVD:

Looks like both batch and the sample type are highly correlated with the PC1:

> kruskal.test(methU$v[,1],as.factor(plotColor)) #plotColor is the vector I used to label tumor and normal samples for the data distribution

        Kruskal-Wallis rank sum test

data:  methU$v[, 1] and as.factor(plotColor)
Kruskal-Wallis chi-squared = 43.0382, df = 1, p-value = 5.368e-11

> kruskal.test(methU$v[,1],as.factor(batch))

        Kruskal-Wallis rank sum test

data:  methU$v[, 1] and as.factor(batch)
Kruskal-Wallis chi-squared = 193.7021, df = 9, p-value < 2.2e-16
> table(batch,plotColor) # relationship between normal samples and the batch
      plotColor
batch  blue red
  0820    0  46
  0825    0  17
  0904    0  39
  1020    4  48
  1110    0   6
  1116    1  18
  1551   17   0
  1552    9   0
  A004    0  24
  A00B    0  26

Also, it looks like normal samples have one outlier. I think the procedure that should be applied here is snm because we definitely want to retain normal/tumor and remove batch. In addition it seems that BCR batch and the sample type are correlated p value= 2.2e-16 (performed chi-square test, although p values are not exact since for two batches I have only 4 and 1 samples). Next I need to investigate to see how the processing batch affects the data (processing batch information will be obtained from .sdrf files).

Relationship for the batches based on TCGA download page for READ:

Relationship between BCR batches and TCGA Archive name for READ:

> table(read$Comment..TCGA.Archive.Name.,read$readBatch)

        0820 0825 0904 1116 1552 A004 A00B
  1.2.0   17    0    0    0    4    0    0
  2.1.0    0   11    0    0    0    0    0
  3.1.0    0    0    0    0    0   10    0
  4.1.0    0    0    0    0    0    0   10
  5.1.0    0    0    9    0    0    0    0
  6.0.0    0    0    0   20    0    0    0

TCGA archive name doesn't correspond exactly to the Batch numbers on the download page. The batch numbers on the download page (Data Access Matrix) seem to correspond to the sample batches sent out by BCR centers.

And here is a table again that shows relationship between TCGA Archive name and BCR batches:

> table(coad$Comment..TCGA.Archive.Name.,coad$coadBatch)

        0820 0825 0904 1020 1110 1551 1552 A004 A00B A081
  1.1.0   32    0    0    0    0   12    5    0    0    0
  2.2.0    0    8    0    0    0    5    0    0    0    0
  3.0.0    0    0    0    0    0    0    0   16    0    0
  4.0.0    0    0    0    0    0    0    0    0   18    0
  5.2.0    0    0   32    0    0    0    0    0    0    0
  6.0.0    0    0    0   53    0    0    0    0    0    0
  7.0.0    0    0    0    0    7    0    0    0    0    0
  8.0.0    0    0    0    0    0    0    0    0    0   25

Use TCGA Archive Name as a surrogate for the batch effect in this colon dataset. Next: normalize the data and identify differentially methylated loci. In addition, TCGA Archive Names are unique and specific to different types of cancer.

Normalization and identification of differentially methylated loci

First, extract TCGA Archive name from sdrf files for READ and COAD and recode them and them merge the sdrf files together. A-F are from READ, G-M are from COAD. Correlation between BCR batches and the processing batches (=TCGA Archive Names):

> table(colonM[,5])

 A  B  C  D  E  F  G  H  I  J  K  L  M
20 10  9  9  8 19 47 12 15 17 31 52  6

> table(colonM$BCRbatch,colonM$ArchiveBatchCode)

        A  B  C  D  E  F  G  H  I  J  K  L  M
  0820 16  0  0  0  0  0 30  0  0  0  0  0  0
  0825  0 10  0  0  0  0  0  7  0  0  0  0  0
  0904  0  0  0  0  8  0  0  0  0  0 31  0  0
  1116  0  0  0  0  0 19  0  0  0  0  0  0  0
  1552  4  0  0  0  0  0  5  0  0  0  0  0  0
  A004  0  0  9  0  0  0  0  0 15  0  0  0  0
  A00B  0  0  0  9  0  0  0  0  0 17  0  0  0
  1020  0  0  0  0  0  0  0  0  0  0  0 52  0
  1110  0  0  0  0  0  0  0  0  0  0  0  0  6
  1551  0  0  0  0  0  0 12  5  0  0  0  0  0

Perform SVD on the combined DNA methylation data and look at the correlation with the processing batch:

PC	P value
1	2.2e-16
2	0.0002451
3	1.051e-10
4	0.02574

In addition processing batch is also highly correlated with the sample type (tumor/normal), Chi-square test p value: 4.727e-07. However it looks like in the batch G there is a decent number of normal samples. If the batch effect is very strong I may need to work only with the batch G to identify differentially methylated regions (DMRs). First, I will remove all batches that don't have any normal samples (B,C, D, E, I, J, K, L) and identify effect of the processing batch on the remaining dataset. Distribution of normal and tumor samples among remained batches:

>table(colonNormal$ArchiveBatchCode,samTypeNorm)
   samTypeNorm
    normal tumor
  A      4    16
  F      1    18
  G     17    30
  H      5     7
  L      4    48

Plotting PC1 of this dataset by batch showed that the batch F is an outlier.

Also, there remains a high correlation of the batch with the sample type, Chi-square test: 0.002922, Fisher test: 0.001395. Remove the batch F (left with 131 samples)

SVD, PCs correlation with batch:

PC	P value for batch	P value for sample type
1	0.0007655	1.467e-14
2	2.422e-07	3.261e-08
3	7.927e-09
5	0.09505

Now remove the batch L since it seems to be an outlier. 79 samples remain (batches A (READ), G (COAD) and H (COAD)): 26 normal and 53 tumor. Found no correlation between the processing batch and the tumor type, Fisher test P value: 0.3092.

PC	P value for batch	P value for sample type
1	0.194	6.549e-13
2	0.0004891	0.715
3	0.7124	0.8104

We want to remove age, gender and batch from the data. Age and gender summaries:

> summary(clinM[,1]) #age
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  41.00   61.50   71.00   69.96   81.00   90.00

#Summary of the gender distribution of the AGH patients:

> table(clinM[,15])

FEMALE   MALE
    40     39

Removing the batch:

> X<-model.matrix(~ colonAGH$ArchiveBatchCode+clinM[,15]+clinM[,1])
> Xall<-solve(t(X) %*% X) %*% t(X) %*% t(mvalAGH)
> resMvalAGH <- mvalAGH-t(X %*% Xall)

Data distribution before and after normalization:

It is interesting to see what removing batch completely removed the effect of bimodality. Brig suggested to put back the intercept term in the data.

Identification of differentially methylated loci. Two methods exist both proposed by the creators of the package "lumi". One is to use F test (which is probably appropriate in this case) or to apply Gamma mixture model (which would work if I had a bimodal distribution). I ran F test and identified ~20000 loci with P value <=0.05. It seems that a lot of them changes their methylation status. Here is an example of 4 and their methylation levels in tumor vs normal samples. It seems that the changes are significant. Looking closer at the M values (y axis) on the graph I can see that some loci changed their methylation from moderately methylated to completely unmethylated and from completely unmethylated to somewhat methylated. I wonder how many of they actually flipped the sign. We might split the sites by the magnitude of changes and work with them separately.

Also, before we agree of the list of DMRs I will look only at the batch G (has the biggest number of normal and tumor samples) and without removing batch but removing age and gender look at the number of DMRs within that batch. Some standard analyses. Notice that in this case I didn't remove the batch (duh!) and although the second mode of the distribution almost disappeared after removed age and gender you can see that some tumor sample have something that looks like a shoulder if not a mode.

F test on the obtained data between normal and tumor samples:

> summary(bgFt)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.0000000 0.0001161 0.0132400 0.1616000 0.2153000 1.0000000
> table(bgFt<=0.05)

FALSE  TRUE
10930 16648

So there are about 16k CpGs which methylation status is significantly different between normal and tumor samples.

In this 2011 review it was mentioned that "Based on a number of relatively large case-control and prospective cohort studies, ~30-40% of sporadic proximal-site colon cancers are CIMP positive, compared to 3-12% of distal colon and rectal cancers". This made me think whether it was a good idea to combine rectal and colon cancer together for the analysis of DNA methylation data. I took batch A which is entirely composed of the rectal samples. It has only 20 patients, 4 normal and 16 tumor. I corrected the data for gender and age and did F test to identify differentially methylated CpGs. Here is the plot of the corrected data and also the histogram of the p values obtained as the result of the test:

Number of CpG which p value is <=0.001: 222. This is significantly different from the number of DMRs obtained for the batch G (colon cancer): 9584. I think it will be appropriate to present a thorough analysis of both datasets normalized separately.

I have also been contemplating about how we would figure out a cutoff for differentially methylated regions after removing the adjustment variables. The problem is that I used linear regression and the residuals seems to be almost normally distributed (to verify that I need to show a quantile plot). In additional, I loose the second mode in the distribution. One idea is to convert the M value residuals to Beta values only for the purpose of figuring out the cutoff. Here is the formula: beta = 2^M/(2^M+1). Although this transformed data might not be appropriate for modeling (severe heteroscedasticity in the lower and upper range. Actually, is it?) but the range of the data will now have a biological meaning: 0% methylation, 50% methylation, 100% methylation. I transformed rectal cancer data used above to calculated DMRs to beta value. Here are the distributions for each sample:

(Color has no meaning) It remarkable to see that samples are very different. On the graph I can see a few samples that are significantly skewed toward the higher methylation level. And yet, when I do differential methylation there are only ~200 that are significantly methylated between normal and tumor samples. I wonder if this graph is a good demonstration of the statement made in many publications that overall amount of DNA methylation is different among different samples. Can I apply F test to beta transformed value? It will be interesting to look at DMR and their respective beta values in normal and tumor samples. It is worth doing either some simple cluster analysis of PCA plots to verify that normal and tumor in rectal cancer separate just as well as it was for colon cancer.