Normalization

There were several ways in which I could normalize the data:

First: Use Level 2 data from TCGA (somehow normalized by TCGA). No description of the methods was available at the time when I started the analyses. Presumably it was done in the same way as for the glioblastoma paper. TCGA's big paper on ovarian cancer just came out and I think they have some explanation how they did it. I haven't read it yet. Also, TCGA's advantage is that they have all Illumina control files from the arrays (hybridization, labeling, DNA concentration) and they didn't upload them to the central repository. Their clinical files don't include this technical information or it is ambiguous. However, they definitely have useful information in their paper such as excluding the stage I patients because they might be very different from any other OC patients.

Second: Second method is to use normalization approaches implemented in Bioconductor lumi package. They offer quantile normalization for color balance adjustment and for the normalization between patients forcing everything to be "the same". Here are a few images from the normalization (I collected level 1 data, extracted M and U values and calculated Log2(M/U) - measure of methylation status of a particular locus proposed by the authors of the package in their paper (they discuss several advantages of this M value vs Illumina's conventional beta value. I summarized this information in my first presentation I gave to the project team)). The first three pictures below show three steps of data normalization performed in lumi: raw data (M value, distinctive bimodal distribution, where the right mode represents methylated CpGs and the left mode represents unmethylated CpGs), color adjusted (Cy5 and Cy3 CpGs are processed together), and quantile normalized. The last image shows two distribution: TCGA processed Level2 data (beta value was converted to M value) and quantile normalized Level 1 data

Third: Finally, with Brig's help I performed supervised normalization of DNA methylation data. First, he suggested to focus only on methylated probed, because unmethylated probes, similarly to mismatch probes on old Affymetrix array are confusing and don't provide any additional information.

In fact the unmethylated probes show overall higher intensity than methylated probes (this was first pointed out by Josh Millstein in our discussion of normalization approaches).

Next he suggested to split the data into two parts: Cy5-labeled probes and Cy3 labeled probes and process them separately and then combine for further analyses. This figure serves as a good explanation why it makes sense to process red and green probes separately (shown the plot of all 27k probes against the first eigenvector in SVD X=UDV(T)). However, I have not adjusted my data for CpG content in probes. I wonder if I do that will it get rid of this weird data distribution? Something to think about.

One can read my Sweave file (filename) for these analysis to understand step by step what and how I did it. In short, I performed PCA on each sub-dataset and identified technical variables that had the biggest influence on my data. It was batch, which was also highly correlated with month and center. When I removed batch, I could still see strange patterns in my data. I ended up also removing the first principal component. Since I was not sure if it was correct for my further analyses I proceded with the dataset from which only batch was removed (called mb) and the dataset from which batch and the first principal component were removed (called mbc). One more thing: I centered the red and green channel before combining them into the final datasets.

This is all great and I agree with Brig's approach as it is very intuitive and unassuming. However, now that I proceed with data analysis it is critical for me to figure out which probes are actually methylated and which are not. Especially because I don't have any control data. How should I approach it? With M value the methods have been developed for drawing a cutoff (because the data has a distinctive bimodal shape). Should I take unmethylated probes, process them similarly to the methylated probes, combine them to make M values and apply the existing method (described here) to figure out what is methylated and what is not? Also, Bin has built comethylation networks based on mb and mbc normalization. Should I rebuild them with a new M value? Something to think about.

Important to remember: I didn't adjust the data for age or stage/grade. In comethylation networks we need to see if there is any association with age/stage/grade (this is the only biology that is available to us). I would also like to see comethylation network built with only stage III patients because it is the largest group and I am not sure how many more novel information we are gaining by keeping a few stage I, II and IV outliers.