Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

"The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an un-methylated measurement. These two values can be measured in one of two ways: using a "Type I" design or a "Type II"  design". CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is not a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a "methylation position" (or "CpG") when we refer to a single-base genomic locus. The previous generation 27k methylation array uses only the Type I design."


Steps outline and decisions:

  1. Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level. 
    The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant. 

    Code Block
    collapsetrue
    > m<-getMeth(Mset.raw)
    > u<-getUnmeth(Mset.raw)
    > dim(m)
    [1] 485577    247
    > dim(u)
    [1] 485577    247
    > mval<-log2((m+0.01)/(u+0.01))
    > x<-apply(mval,2,function(x) which(is.na(x)))
    > length(x)
    [1] 0 

    SVD on the entire matrix and correlation with the technical variables. Outliers look very "special":
    Image RemovedImage RemovedImage RemovedThe outliers of PC1 look rather peculiar:
    Image AddedImage AddedImage AddedImage AddedImage AddedImage AddedImage Added

    Code Block
    collapsetrue
    > techs
        Sample_Well Sample_Plate Sample_Group      Pool_ID        Array        Slide
    V1 9.329435e-05 1.591332e-03 7.122791e-25 1.528817e-01 3.744530e-01 1.407002e-02
    V2 6.135025e-03 3.229857e-03 1.906136e-04 5.395826e-02 1.004567e-01 8.358575e-03
    V3 4.071548e-01 3.717899e-02 8.622844e-07 4.499818e-02 1.489702e-01 2.593915e-01
    V4 1.022541e-06 3.755171e-26 1.785824e-01 8.097790e-27 4.772079e-01 9.604426e-24
    V5 4.162074e-05 7.830607e-07 9.710665e-04 6.727818e-06 4.763077e-01 1.466018e-03
    V6 6.086921e-02 1.004642e-05 1.838061e-04 1.980604e-05 2.080647e-12 7.643723e-06
    V7 9.850379e-04 4.815462e-06 6.861262e-02 3.796909e-06 2.863657e-01 1.645806e-03
    V8 3.030042e-04 2.025094e-12 3.895650e-07 1.657376e-11 2.965493e-02 2.441389e-07

    Pool_ID (processing batch) is highly correlated with the 4th principal component which is responsible for only 4% of the total variance. Good news but I do have to split the probes into single and double color. Below are 3 examples of the intensity distribution (M values) for the type I and type II probes for the same patients:

    Image AddedImage AddedImage Added

    They obviously look very different which was also noted in the paper about the 450k arrays published by Illumina (Figure 3)

    Next I was planning on normalizing each type and methylated/unmethylated probes separately and completely (i.e. remove intensity, batch, gender, age and color) however I found out that batch effect is considerably stronger on the probe level. Type I unmethylated probes:

    Image AddedImage AddedImage Added

    Code Block
    collapsetrue
    #Unmethylated type I, correlation of PCs with the technical variables:
       Sample_Well Sample_Plate Sample_Group      Pool_ID        Array       Slide
    V1 2.470072e-02 2.225795e-14 2.623117e-01 7.263874e-14 3.330435e-16 3.515287e-13
    V2 5.636983e-04 2.986398e-05 2.993088e-23 2.297371e-02 8.221251e-01 1.164875e-03
    V3 3.916829e-02 2.033036e-11 1.125417e-02 5.979067e-12 6.966765e-07 9.014080e-12
    V4 2.478553e-06 2.533717e-07 1.566910e-05 9.576394e-07 4.558558e-02 1.353229e-04
    V5 5.429293e-03 1.741677e-02 7.995511e-01 1.928311e-02 3.146610e-01 1.506602e-01
    V6 2.680434e-08 2.803999e-21 5.782815e-01 1.985964e-21 5.400452e-01 3.044027e-18
    V7 8.382729e-06 1.512734e-07 4.341712e-04 1.957076e-07 6.198122e-01 2.228316e-04
    V8 1.693751e-04 2.474007e-39 6.771534e-01 5.941162e-40 9.937301e-01 9.416174e-32

    Now the processing batch is highly correlated with PC1 which explain 60% of variance in the data. I performed similar analyses for the type I methylated probes and found PC1 explains 40% of the variance and is highly correlated with batch and not with the sample group which is normal/tumor. For the type II probes Array seemed to have a bigger effect on the first principal component than the processing batch but again PC1 wasn't correlated with my biology at all.

    Decision: split type I probes into methylated and unmethylated and those in turn into "red" and "green". Normalize each of these 4 datasets by removing intensity dependent effects. Then combine them and calculated the M value because batch seems to have a smaller effect on the combined data. For the type II probes normalize them using the snm package and remove intensity and dye effects. Combine and calculate the M value. Combine ("stack") methylated and unmethylated probes in a single matrix. 

  2. Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
    Normalized everything on my work computer.  Normalization of the type I probes:

    Code Block
    collapsetrue
    #Created an int.var for the array intensity effects
    > int.var<-data.frame(array=factor(1:247)) 
    > mred<-snm(as.matrix(mtypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
    Warning message:
    In snm(as.matrix(mtypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
      bio.var=NULL, so all probes will be treated as 'null' in the normalization.
    > mgreen<-snm(as.matrix(mtypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
    Warning message:
    In snm(as.matrix(mtypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
      bio.var=NULL, so all probes will be treated as 'null' in the normalization.
    > ugreen<-snm(as.matrix(utypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var)
      Warning message:
      In snm(as.matrix(utypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) :
      bio.var=NULL, so all probes will be treated as 'null' in the normalization.
    > ured<-snm(as.matrix(utypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var)
    Warning message:
     In snm(as.matrix(utypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) :
      bio.var=NULL, so all probes will be treated as 'null' in the normalization.

    Here is how I scaled and calculated the M values for the type I probes:

    Code Block
    collapsetrue
    > range(mred$norm.dat)
    [1] -5318.258 40564.597
    > range(ured$norm.dat)
    [1] -5011.863 34392.538
    > mvalred<-log2((mred$norm.dat+6000)/(ured$norm.dat+6000))
    > range(ugreen$norm.dat)
    [1] -7209.145 30610.116
    > range(mgreen$norm.dat)
    [1] -7440.55 36960.34
    > mvalgreen<-log2((mgreen$norm.dat+8000)/(ugreen$norm.dat+8000))
    > mvalredScaled<-scale(mvalred)
    > mvalgreenScaled<-scale(mvalgreen)
    > mval<-rbind(mvalredScaled,mvalgreenScaled) 
  3. Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value

    I learned from the minfi vignette that methylated probes are always measured in green color, i.e. Cy5. Good to know for normalization. In order to correctly normalize the arrays I need to put them in one big matrix. Lets have the red (u) first and the the green (m) next.

    Code Block
    collapsetrue
    > comb<-cbind(u,m)
    > dim(comb)
    [1] 350076    494 
    > red<-data.frame(array=factor(1:ncol(u)),dye=rep("CY3",ncol(u)))
    > green<-data.frame(array=factor(1:ncol(m)),dye=rep("CY5",ncol(m)))
    > int.var<-rbind(red,green)
    > dim(int.var)
    [1] 494   2
      > snm.fit<-snm(as.matrix(comb),bio.var=NULL,adj.var=NULL,int.var=int.var)
    Warning message:
    In snm(as.matrix(comb), bio.var = NULL, adj.var = NULL, int.var = int.var) :
      bio.var=NULL, so all probes will be treated as 'null' in the normalization.

    On my work computer it took several hours to finish the normalization. 

  4. Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.

...