DNA mehylation data normalization

Quote from the vignette for the minfi package:

"The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an un-methylated measurement. These two values can be measured in one of two ways: using a "Type I" design or a "Type II" design". CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is not a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a "methylation position" (or "CpG") when we refer to a single-base genomic locus. The previous generation 27k methylation array uses only the Type I design."

Steps outline and decisions:

Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level.
The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant.
Expand source
```
> m<-getMeth(Mset.raw)
> u<-getUnmeth(Mset.raw)
> dim(m)
[1] 485577    247
> dim(u)
[1] 485577    247
> mval<-log2((m+0.01)/(u+0.01))
> x<-apply(mval,2,function(x) which(is.na(x)))
> length(x)
[1] 0 
```
SVD on the entire matrix and correlation with the technical variables. The outliers of PC1 look rather peculiar:
Expand source
```
> techs
    Sample_Well Sample_Plate Sample_Group      Pool_ID        Array        Slide
V1 9.329435e-05 1.591332e-03 7.122791e-25 1.528817e-01 3.744530e-01 1.407002e-02
V2 6.135025e-03 3.229857e-03 1.906136e-04 5.395826e-02 1.004567e-01 8.358575e-03
V3 4.071548e-01 3.717899e-02 8.622844e-07 4.499818e-02 1.489702e-01 2.593915e-01
V4 1.022541e-06 3.755171e-26 1.785824e-01 8.097790e-27 4.772079e-01 9.604426e-24
V5 4.162074e-05 7.830607e-07 9.710665e-04 6.727818e-06 4.763077e-01 1.466018e-03
V6 6.086921e-02 1.004642e-05 1.838061e-04 1.980604e-05 2.080647e-12 7.643723e-06
V7 9.850379e-04 4.815462e-06 6.861262e-02 3.796909e-06 2.863657e-01 1.645806e-03
V8 3.030042e-04 2.025094e-12 3.895650e-07 1.657376e-11 2.965493e-02 2.441389e-07
```
Pool_ID (processing batch) is highly correlated with the 4th principal component which is responsible for only 4% of the total variance. Good news but I do have to split the probes into single and double color. Below are 3 examples of the intensity distribution (M values) for the type I and type II probes for the same patients:
They obviously look very different which was also noted in the paper about the 450k arrays published by Illumina (Figure 3)
Next I was planning on normalizing each type and methylated/unmethylated probes separately and completely (i.e. remove intensity, batch, gender, age and color) however I found out that batch effect is considerably stronger on the probe level. Type I unmethylated probes:
Expand source
```
#Unmethylated type I, correlation of PCs with the technical variables:
   Sample_Well Sample_Plate Sample_Group      Pool_ID        Array       Slide
V1 2.470072e-02 2.225795e-14 2.623117e-01 7.263874e-14 3.330435e-16 3.515287e-13
V2 5.636983e-04 2.986398e-05 2.993088e-23 2.297371e-02 8.221251e-01 1.164875e-03
V3 3.916829e-02 2.033036e-11 1.125417e-02 5.979067e-12 6.966765e-07 9.014080e-12
V4 2.478553e-06 2.533717e-07 1.566910e-05 9.576394e-07 4.558558e-02 1.353229e-04
V5 5.429293e-03 1.741677e-02 7.995511e-01 1.928311e-02 3.146610e-01 1.506602e-01
V6 2.680434e-08 2.803999e-21 5.782815e-01 1.985964e-21 5.400452e-01 3.044027e-18
V7 8.382729e-06 1.512734e-07 4.341712e-04 1.957076e-07 6.198122e-01 2.228316e-04
V8 1.693751e-04 2.474007e-39 6.771534e-01 5.941162e-40 9.937301e-01 9.416174e-32
```
Now the processing batch is highly correlated with PC1 which explain 60% of variance in the data. I performed similar analyses for the type I methylated probes and found PC1 explains 40% of the variance and is highly correlated with batch and not with the sample group which is normal/tumor. For the type II probes Array seemed to have a bigger effect on the first principal component than the processing batch but again PC1 wasn't correlated with my biology at all.
Decision: split type I probes into methylated and unmethylated and those in turn into "red" and "green". Normalize each of these 4 datasets by removing intensity dependent effects. Then combine them and calculated the M value because batch seems to have a smaller effect on the combined data. For the type II probes normalize them using the snm package and remove intensity and dye effects. Combine and calculate the M value. Combine ("stack") methylated and unmethylated probes in a single matrix.

Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
Normalized everything on my work computer. Normalization of the type I probes:

#Created an int.var for the array intensity effects
> int.var<-data.frame(array=factor(1:247)) 
> mred<-snm(as.matrix(mtypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> mgreen<-snm(as.matrix(mtypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ugreen<-snm(as.matrix(utypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
  Warning message:
  In snm(as.matrix(utypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ured<-snm(as.matrix(utypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
 In snm(as.matrix(utypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.

Here is how I scaled and calculated the M values for the type I probes:

> range(mred$norm.dat)
[1] -5318.258 40564.597
> range(ured$norm.dat)
[1] -5011.863 34392.538
> mvalred<-log2((mred$norm.dat+6000)/(ured$norm.dat+6000))
> range(ugreen$norm.dat)
[1] -7209.145 30610.116
> range(mgreen$norm.dat)
[1] -7440.55 36960.34
> mvalgreen<-log2((mgreen$norm.dat+8000)/(ugreen$norm.dat+8000))
> mvalredScaled<-scale(mvalred)
> mvalgreenScaled<-scale(mvalgreen)
> mval<-rbind(mvalredScaled,mvalgreenScaled)

Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value

I learned from the minfi vignette that methylated probes are always measured in green color, i.e. Cy5. Good to know for normalization. In order to correctly normalize the arrays I need to put them in one big matrix. Lets have the red (u) first and the the green (m) next.

> comb<-cbind(u,m)
> dim(comb)
[1] 350076    494 
> red<-data.frame(array=factor(1:ncol(u)),dye=rep("CY3",ncol(u)))
> green<-data.frame(array=factor(1:ncol(m)),dye=rep("CY5",ncol(m)))
> int.var<-rbind(red,green)
> dim(int.var)
[1] 494   2
  > snm.fit<-snm(as.matrix(comb),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(comb), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.

On my work computer it took several hours to finish the normalization.

Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.