DNA mehylation data normalization

Quote from the vignette for the minfi package:

"The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an un-methylated measurement. These two values can be measured in one of two ways: using a "Type I" design or a "Type II" design". CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is not a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a "methylation position" (or "CpG") when we refer to a single-base genomic locus. The previous generation 27k methylation array uses only the Type I design."

Steps outline and decisions:

Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level.
The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant.
Expand source
```
> m<-getMeth(Mset.raw)
> u<-getUnmeth(Mset.raw)
> dim(m)
[1] 485577    247
> dim(u)
[1] 485577    247
> mval<-log2((m+0.01)/(u+0.01))
> x<-apply(mval,2,function(x) which(is.na(x)))
> length(x)
[1] 0 
```
SVD on the entire matrix and correlation with the technical variables. The outliers of PC1 look rather peculiar:
Expand source
```
> techs
    Sample_Well Sample_Plate Sample_Group      Pool_ID        Array        Slide
V1 9.329435e-05 1.591332e-03 7.122791e-25 1.528817e-01 3.744530e-01 1.407002e-02
V2 6.135025e-03 3.229857e-03 1.906136e-04 5.395826e-02 1.004567e-01 8.358575e-03
V3 4.071548e-01 3.717899e-02 8.622844e-07 4.499818e-02 1.489702e-01 2.593915e-01
V4 1.022541e-06 3.755171e-26 1.785824e-01 8.097790e-27 4.772079e-01 9.604426e-24
V5 4.162074e-05 7.830607e-07 9.710665e-04 6.727818e-06 4.763077e-01 1.466018e-03
V6 6.086921e-02 1.004642e-05 1.838061e-04 1.980604e-05 2.080647e-12 7.643723e-06
V7 9.850379e-04 4.815462e-06 6.861262e-02 3.796909e-06 2.863657e-01 1.645806e-03
V8 3.030042e-04 2.025094e-12 3.895650e-07 1.657376e-11 2.965493e-02 2.441389e-07
```
Pool_ID (processing batch) is highly correlated with the 4th principal component which is responsible for only 4% of the total variance. Good news but I do have to split the probes into single and double color. Below are 3 examples of the intensity distribution (M values) for the type I and type II probes for the same patients:
They obviously look very different which was also noted in the paper about the 450k arrays published by Illumina (Figure 3)
Next I was planning on normalizing each type and methylated/unmethylated probes separately and completely (i.e. remove intensity, batch, gender, age and color) however I found out that batch effect is considerably stronger on the probe level. Type I unmethylated probes:
Expand source
```
#Unmethylated type I, correlation of PCs with the technical variables:
   Sample_Well Sample_Plate Sample_Group      Pool_ID        Array       Slide
V1 2.470072e-02 2.225795e-14 2.623117e-01 7.263874e-14 3.330435e-16 3.515287e-13
V2 5.636983e-04 2.986398e-05 2.993088e-23 2.297371e-02 8.221251e-01 1.164875e-03
V3 3.916829e-02 2.033036e-11 1.125417e-02 5.979067e-12 6.966765e-07 9.014080e-12
V4 2.478553e-06 2.533717e-07 1.566910e-05 9.576394e-07 4.558558e-02 1.353229e-04
V5 5.429293e-03 1.741677e-02 7.995511e-01 1.928311e-02 3.146610e-01 1.506602e-01
V6 2.680434e-08 2.803999e-21 5.782815e-01 1.985964e-21 5.400452e-01 3.044027e-18
V7 8.382729e-06 1.512734e-07 4.341712e-04 1.957076e-07 6.198122e-01 2.228316e-04
V8 1.693751e-04 2.474007e-39 6.771534e-01 5.941162e-40 9.937301e-01 9.416174e-32
```
Now the processing batch is highly correlated with PC1 which explain 60% of variance in the data. I performed similar analyses for the type I methylated probes and found PC1 explains 40% of the variance and is highly correlated with batch and not with the sample group which is normal/tumor. For the type II probes Array seemed to have a bigger effect on the first principal component than the processing batch but again PC1 wasn't correlated with my biology at all.
Decision: split type I probes into methylated and unmethylated and those in turn into "red" and "green". Normalize each of these 4 datasets by removing intensity dependent effects. Then combine them and calculated the M value because batch seems to have a smaller effect on the combined data. For the type II probes normalize them using the snm package and remove intensity and dye effects. Combine and calculate the M value. Combine ("stack") methylated and unmethylated probes in a single matrix.

Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
Normalized everything on my work computer. Normalization of the type I probes:

#Created an int.var for the array intensity effects
> int.var<-data.frame(array=factor(1:247)) 
> mred<-snm(as.matrix(mtypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> mgreen<-snm(as.matrix(mtypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ugreen<-snm(as.matrix(utypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
  Warning message:
  In snm(as.matrix(utypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ured<-snm(as.matrix(utypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
 In snm(as.matrix(utypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.

Here is how I scaled and calculated the M values for the type I probes:

> range(mred$norm.dat)
[1] -5318.258 40564.597
> range(ured$norm.dat)
[1] -5011.863 34392.538
> mvalred<-log2((mred$norm.dat+6000)/(ured$norm.dat+6000))
> range(ugreen$norm.dat)
[1] -7209.145 30610.116
> range(mgreen$norm.dat)
[1] -7440.55 36960.34
> mvalgreen<-log2((mgreen$norm.dat+8000)/(ugreen$norm.dat+8000))
> mvalredScaled<-scale(mvalred)
> mvalgreenScaled<-scale(mvalgreen)
> mval<-rbind(mvalredScaled,mvalgreenScaled)

Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value

I learned from the minfi vignette that methylated probes are always measured in green color, i.e. Cy5. Good to know for normalization. In order to correctly normalize the arrays I need to put them in one big matrix. Lets have the red (u) first and the the green (m) next.

> comb<-cbind(u,m)
> dim(comb)
[1] 350076    494 
> red<-data.frame(array=factor(1:ncol(u)),dye=rep("CY3",ncol(u)))
> green<-data.frame(array=factor(1:ncol(m)),dye=rep("CY5",ncol(m)))
> int.var<-rbind(red,green)
> dim(int.var)
[1] 494   2
  > snm.fit<-snm(as.matrix(comb),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(comb), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> utypeII<-snm.fit$norm.dat[,1:247]
> mtypeII<-snm.fit$norm.dat[,248:494]
#Scaling and calculation of the M value:
> range(utypeII)
[1] -2719.159 39726.383
> range(mtypeII)
[1] -6187.47 33669.09
> mvaltypeII<-log2((mtypeII+7000)/(utypeII+7000))

On my work computer it took several hours to finish the normalization.

Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.
Combine type I and type II data in one matrix. Do SVD. The outliers of PC1 look slightly better comparing to the original:

 Sample_Well Sample_Plate Sample_Group      Pool_ID        Array       Slide
V1 2.918564e-02 2.541558e-06 3.532357e-02 9.422801e-07 1.765521e-19 6.932534e-07
V2 1.625108e-04 2.660718e-04 3.008828e-24 1.149216e-01 1.886011e-01 7.380575e-03
V3 5.518528e-02 7.475338e-03 6.355672e-02 2.339687e-02 2.712102e-01 7.562814e-02
V4 3.445400e-03 1.791384e-02 5.322822e-01 5.526711e-02 2.310760e-03 1.441949e-01
V5 9.032115e-05 2.573920e-14 3.013233e-08 2.880643e-14 3.240300e-01 2.831130e-11
V6 1.436765e-04 1.547973e-04 8.790014e-02 8.015588e-04 3.450516e-01 1.456371e-03
V7 1.305277e-03 1.156332e-01 8.619749e-01 2.110003e-01 1.269173e-02 7.318850e-01
V8 3.375560e-02 8.054324e-03 4.038703e-07 1.190591e-02 1.744036e-01 1.080657e-01

Now I see that PC1 is correlated significantly with Array and then also with the Slide and the Pool_ID. Array doesn't correlate with any biological variables so start by removing it. Few other things: I found that I have 7 cell culture samples, so remove those and this leaves me with 238 samples. In addition, I want to remove age and gender and two patients didn't have any gender information so remove those too. I also wanted to remove BMI but unfortunately there are too many NAs.
Build the model matrix. Normalization was done using the snm package:

#Create the biological variables "bucket":
> bio.var<-model.matrix(~factor(sampleMatchNNA$Sample_Group))
> adj.var<-model.matrix(~factor(sampleMatchNNA$gender)+sampleMatchNNA$age_at_initial_pathologic_diagnosis+factor(sampleMatchNNA$Array))
> mvalNorm<-snm(mvalTissueNNA, bio.var=bio.var,adj.var=adj.var, int.var=NULL, rm.adj=TRUE)

SVD and correlation with the technical variables:

> tech
    Sample_Well Sample_Plate Sample_Group      Pool_ID     Array        Slide
V1 5.654163e-03 1.300140e-05 1.025313e-20 5.615517e-04 0.9981125 9.029768e-06
V2 8.798479e-03 2.122247e-14 2.692146e-05 8.443727e-13 0.9999954 9.401617e-17
V3 2.519435e-01 4.527614e-01 4.522880e-01 3.650234e-01 0.9999678 4.158308e-01
V4 1.809132e-01 1.077572e-03 4.386365e-01 9.374620e-04 0.9698694 2.280764e-02
V5 5.529533e-06 5.152086e-13 1.966378e-11 5.863403e-13 0.9997771 2.522627e-10
V6 2.477964e-01 9.624819e-03 2.139038e-06 8.175024e-02 0.9999946 4.538778e-02
V7 3.867747e-01 7.110760e-02 5.214108e-02 2.278815e-01 0.9986385 4.399445e-02
V8 2.028927e-02 6.467366e-04 6.316455e-02 3.395680e-03 0.5685680 9.088878e-04

I still see a strong correlation with the slide and the pool_id. Remove the slide effect.

> adj.var<-model.matrix(~factor(sampleMatchNNA$gender)+sampleMatchNNA$age_at_initial_pathologic_diagnosis+factor(sampleMatchNNA$Array)+factor(sampleMatchNNA$Slide))
> mvalNorm<-snm(mvalTissueNNA, bio.var=bio.var,adj.var=adj.var, int.var=NULL, rm.adj=TRUE)

> tech
    Sample_Well Sample_Plate Sample_Group    Pool_ID     Array     Slide
V1 0.0007407836 0.0002202235 1.167682e-22 0.05614181 0.9832427 0.0991713
V2 0.7121442086 0.9985967341 1.370995e-01 0.99988478 0.9999998 1.0000000
V3 0.8872558733 0.9983135653 4.048457e-01 0.99666249 0.9981009 1.0000000
V4 0.8342893562 0.9940254760 4.077207e-01 0.99424726 0.9947813 1.0000000
V5 0.0075332014 0.0303129288 1.830997e-12 0.30065288 0.9993607 0.8292775
V6 0.7399454196 0.6379940483 6.765006e-05 0.85893411 0.9999992 0.9999641
V7 0.5987213698 0.9819645768 3.531294e-02 0.99995988 0.9986432 0.9999682
V8 0.7564265374 0.9959541522 6.756497e-01 0.98944744 0.8980192 0.9999792

So we successfully got rid of the slide, array and the processing batch. At the same time the first PC is highly correlated with the sample type and we also see some correlation with the Sample_Well (center) and Sample_Plate (BCR barcode) which means that we still have some signal in our data!!

The end. Consider the data to be normalized