Page Comparison

...

"The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an un-methylated measurement. These two values can be measured in one of two ways: using a "Type I" design or a "Type II" design". CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is not a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a "methylation position" (or "CpG") when we refer to a single-base genomic locus. The previous generation 27k methylation array uses only the Type I design."

Steps outline and decisions:

Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level.
The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant.

Code Block

collapse	true

> m<-getMeth(Mset.raw)
> u<-getUnmeth(Mset.raw)
> dim(m)
[1] 485577    247
> dim(u)
[1] 485577    247
> mval<-log2((m+0.01)/(u+0.01))
> x<-apply(mval,2,function(x) which(is.na(x)))
> length(x)
[1] 0

SVD on the entire matrix and correlation with the technical variables. Outliers look very "special":
Image RemovedImage RemovedImage RemovedThe outliers of PC1 look rather peculiar:
Image AddedImage AddedImage AddedImage AddedImage AddedImage AddedImage Added

Code Block

collapse	true

> techs
    Sample_Well Sample_Plate Sample_Group      Pool_ID        Array        Slide
V1 9.329435e-05 1.591332e-03 7.122791e-25 1.528817e-01 3.744530e-01 1.407002e-02
V2 6.135025e-03 3.229857e-03 1.906136e-04 5.395826e-02 1.004567e-01 8.358575e-03
V3 4.071548e-01 3.717899e-02 8.622844e-07 4.499818e-02 1.489702e-01 2.593915e-01
V4 1.022541e-06 3.755171e-26 1.785824e-01 8.097790e-27 4.772079e-01 9.604426e-24
V5 4.162074e-05 7.830607e-07 9.710665e-04 6.727818e-06 4.763077e-01 1.466018e-03
V6 6.086921e-02 1.004642e-05 1.838061e-04 1.980604e-05 2.080647e-12 7.643723e-06
V7 9.850379e-04 4.815462e-06 6.861262e-02 3.796909e-06 2.863657e-01 1.645806e-03
V8 3.030042e-04 2.025094e-12 3.895650e-07 1.657376e-11 2.965493e-02 2.441389e-07

Pool_ID (processing batch) is highly correlated with the 4th principal component which is responsible for only 4% of the total variance. Good news but I do have to split the probes into single and double color. Below are 3 examples of the intensity distribution (M values) for the type I and type II probes for the same patients:

Image AddedImage AddedImage Added

They obviously look very different which was also noted in the paper about the 450k arrays published by Illumina (Figure 3)

Next I was planning on normalizing each type and methylated/unmethylated probes separately and completely (i.e. remove intensity, batch, gender, age and color) however I found out that batch effect is considerably stronger on the probe level. Type I unmethylated probes:

Image AddedImage AddedImage Added

Code Block

collapse	true

#Unmethylated type I, correlation of PCs with the technical variables:
   Sample_Well Sample_Plate Sample_Group      Pool_ID        Array       Slide
V1 2.470072e-02 2.225795e-14 2.623117e-01 7.263874e-14 3.330435e-16 3.515287e-13
V2 5.636983e-04 2.986398e-05 2.993088e-23 2.297371e-02 8.221251e-01 1.164875e-03
V3 3.916829e-02 2.033036e-11 1.125417e-02 5.979067e-12 6.966765e-07 9.014080e-12
V4 2.478553e-06 2.533717e-07 1.566910e-05 9.576394e-07 4.558558e-02 1.353229e-04
V5 5.429293e-03 1.741677e-02 7.995511e-01 1.928311e-02 3.146610e-01 1.506602e-01
V6 2.680434e-08 2.803999e-21 5.782815e-01 1.985964e-21 5.400452e-01 3.044027e-18
V7 8.382729e-06 1.512734e-07 4.341712e-04 1.957076e-07 6.198122e-01 2.228316e-04
V8 1.693751e-04 2.474007e-39 6.771534e-01 5.941162e-40 9.937301e-01 9.416174e-32

Now the processing batch is highly correlated with PC1 which explain 60% of variance in the data. I performed similar analyses for the type I methylated probes and found PC1 explains 40% of the variance and is highly correlated with batch and not with the sample group which is normal/tumor. For the type II probes Array seemed to have a bigger effect on the first principal component than the processing batch but again PC1 wasn't correlated with my biology at all.

Decision: split type I probes into methylated and unmethylated and those in turn into "red" and "green". Normalize each of these 4 datasets by removing intensity dependent effects. Then combine them and calculated the M value because batch seems to have a smaller effect on the combined data. For the type II probes normalize them using the snm package and remove intensity and dye effects. Combine and calculate the M value. Combine ("stack") methylated and unmethylated probes in a single matrix.

Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
Normalized everything on my work computer. Normalization of the type I probes:

Code Block

collapse	true

#Created an int.var for the array intensity effects
> int.var<-data.frame(array=factor(1:247)) 
> mred<-snm(as.matrix(mtypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> mgreen<-snm(as.matrix(mtypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(mtypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ugreen<-snm(as.matrix(utypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
  Warning message:
  In snm(as.matrix(utypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.
> ured<-snm(as.matrix(utypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
 In snm(as.matrix(utypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.

Here is how I scaled and calculated the M values for the type I probes:

Code Block

collapse	true

> range(mred$norm.dat)
[1] -5318.258 40564.597
> range(ured$norm.dat)
[1] -5011.863 34392.538
> mvalred<-log2((mred$norm.dat+6000)/(ured$norm.dat+6000))
> range(ugreen$norm.dat)
[1] -7209.145 30610.116
> range(mgreen$norm.dat)
[1] -7440.55 36960.34
> mvalgreen<-log2((mgreen$norm.dat+8000)/(ugreen$norm.dat+8000))
> mvalredScaled<-scale(mvalred)
> mvalgreenScaled<-scale(mvalgreen)
> mval<-rbind(mvalredScaled,mvalgreenScaled)

Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value

I learned from the minfi vignette that methylated probes are always measured in green color, i.e. Cy5. Good to know for normalization. In order to correctly normalize the arrays I need to put them in one big matrix. Lets have the red (u) first and the the green (m) next.

Code Block

collapse	true

> comb<-cbind(u,m)
> dim(comb)
[1] 350076    494 
> red<-data.frame(array=factor(1:ncol(u)),dye=rep("CY3",ncol(u)))
> green<-data.frame(array=factor(1:ncol(m)),dye=rep("CY5",ncol(m)))
> int.var<-rbind(red,green)
> dim(int.var)
[1] 494   2
  > snm.fit<-snm(as.matrix(comb),bio.var=NULL,adj.var=NULL,int.var=int.var)
Warning message:
In snm(as.matrix(comb), bio.var = NULL, adj.var = NULL, int.var = int.var) :
  bio.var=NULL, so all probes will be treated as 'null' in the normalization.

On my work computer it took several hours to finish the normalization.

Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.

...

Version	Old Version 2	New Version 3
Changes made by	Vitalina Komashko (Unlicensed)	Vitalina Komashko (Unlicensed)
Saved on	Feb 29, 2012	Feb 29, 2012

Versions Compared

Key