Normalization

There were several ways in which I could normalize the data:

Use Level 2 data from TCGA (somehow normalized by TCGA). No description of the methods was available at the time when I started the analyses. Presumably it was done in the same way as for the glioblastoma paper. TCGA's big paper on ovarian cancer just came out and I think they have some explanation how they did it. I haven't read it yet. Also, TCGA's advantage is that they have all Illumina control files from the arrays (hybridization, labeling, DNA concentration) and they didn't upload them to the central repository. Their clinical files don't include this technical information or it is ambiguous. However, they definitely have useful information in their paper such as excluding the stage I patients because they might be very different from any other OC patients.
Second method is to use normalization approaches implemented in Bioconductor lumi package. They offer quantile normalization for color balance adjustment and for the normalization between patients forcing everything to be "the same". Here are a few images from the normalization (I collected level 1 data, extracted M and U values and calculated Log2(M/U) - measure of methylation status of a particular locus proposed by the authors of the package in their paper (they discuss several advantage of this M value vs Illumina's conventional beta value. I summarized this information in my first presentation I have to the project team)).

Finally, with Brig's help I performed supervised normalization of DNA methylation data. First, he suggested to focus only on methylated probed, because unmethylated probes, similarly to mismatch probes on old Affymetrix array are confusing and don't provide any additional information. Next he suggested to split the data into two parts: Cy5-labeled probes and Cy3 labeled probes and process them separately and then combine for further analyses. One can read my Sweave file (filename) for these analysis to understand step by step what and how I did it. In short, I performed PCA on each sub-dataset and identified technical variables that had the biggest influence on my data. It was batch, which was also highly correlated with month and center. When I removed batch, I could still see strange patterns in my data. I ended up also removing the first principal component. Since I was not sure if it was correct for my further analyses I proceded with the dataset from which only batch was removed (called mb) and the dataset from which batch and the first principal component were removed (called mbc).