Quote from the vignette for the minfi package:
"The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an un-methylated measurement. These two values can be measured in one of two ways: using a "Type I" design or a "Type II" design". CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is not a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a "methylation position" (or "CpG") when we refer to a single-base genomic locus. The previous generation 27k methylation array uses only the Type I design."
Steps outline and decisions:
Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level.
The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant.SVD on the entire matrix and correlation with the technical variables. Outliers look very "special":
- Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
- Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value
- Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.