...
Split the probes into type I and type II probes because they will be normalized separately as single color and two-color arrays. Demonstrate their differences. Demonstrate batch effect on probe level.
The package has a built in function for calculating the M value but it produced a matrix with NA which is because log2 of negative values was attempted (I think). I extracted methylated and unmethylated probes and calculated the M value as log2((meth+c)/(unmeth+c)) where c is a constant.Code Block collapse true > m<-getMeth(Mset.raw) > u<-getUnmeth(Mset.raw) > dim(m) [1] 485577 247 > dim(u) [1] 485577 247 > mval<-log2((m+0.01)/(u+0.01)) > x<-apply(mval,2,function(x) which(is.na(x))) > length(x) [1] 0
SVD on the entire matrix and correlation with the technical variables. The outliers of PC1 look rather peculiar:
Code Block collapse true > techs Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide V1 9.329435e-05 1.591332e-03 7.122791e-25 1.528817e-01 3.744530e-01 1.407002e-02 V2 6.135025e-03 3.229857e-03 1.906136e-04 5.395826e-02 1.004567e-01 8.358575e-03 V3 4.071548e-01 3.717899e-02 8.622844e-07 4.499818e-02 1.489702e-01 2.593915e-01 V4 1.022541e-06 3.755171e-26 1.785824e-01 8.097790e-27 4.772079e-01 9.604426e-24 V5 4.162074e-05 7.830607e-07 9.710665e-04 6.727818e-06 4.763077e-01 1.466018e-03 V6 6.086921e-02 1.004642e-05 1.838061e-04 1.980604e-05 2.080647e-12 7.643723e-06 V7 9.850379e-04 4.815462e-06 6.861262e-02 3.796909e-06 2.863657e-01 1.645806e-03 V8 3.030042e-04 2.025094e-12 3.895650e-07 1.657376e-11 2.965493e-02 2.441389e-07
Pool_ID (processing batch) is highly correlated with the 4th principal component which is responsible for only 4% of the total variance. Good news but I do have to split the probes into single and double color. Below are 3 examples of the intensity distribution (M values) for the type I and type II probes for the same patients:
They obviously look very different which was also noted in the paper about the 450k arrays published by Illumina (Figure 3)
Next I was planning on normalizing each type and methylated/unmethylated probes separately and completely (i.e. remove intensity, batch, gender, age and color) however I found out that batch effect is considerably stronger on the probe level. Type I unmethylated probes:
Code Block collapse true #Unmethylated type I, correlation of PCs with the technical variables: Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide V1 2.470072e-02 2.225795e-14 2.623117e-01 7.263874e-14 3.330435e-16 3.515287e-13 V2 5.636983e-04 2.986398e-05 2.993088e-23 2.297371e-02 8.221251e-01 1.164875e-03 V3 3.916829e-02 2.033036e-11 1.125417e-02 5.979067e-12 6.966765e-07 9.014080e-12 V4 2.478553e-06 2.533717e-07 1.566910e-05 9.576394e-07 4.558558e-02 1.353229e-04 V5 5.429293e-03 1.741677e-02 7.995511e-01 1.928311e-02 3.146610e-01 1.506602e-01 V6 2.680434e-08 2.803999e-21 5.782815e-01 1.985964e-21 5.400452e-01 3.044027e-18 V7 8.382729e-06 1.512734e-07 4.341712e-04 1.957076e-07 6.198122e-01 2.228316e-04 V8 1.693751e-04 2.474007e-39 6.771534e-01 5.941162e-40 9.937301e-01 9.416174e-32
Now the processing batch is highly correlated with PC1 which explain 60% of variance in the data. I performed similar analyses for the type I methylated probes and found PC1 explains 40% of the variance and is highly correlated with batch and not with the sample group which is normal/tumor. For the type II probes Array seemed to have a bigger effect on the first principal component than the processing batch but again PC1 wasn't correlated with my biology at all.
Decision: split type I probes into methylated and unmethylated and those in turn into "red" and "green". Normalize each of these 4 datasets by removing intensity dependent effects. Then combine them and calculated the M value because batch seems to have a smaller effect on the combined data. For the type II probes normalize them using the snm package and remove intensity and dye effects. Combine and calculate the M value. Combine ("stack") methylated and unmethylated probes in a single matrix.
Split type I into four datasets: unmethylated red, methylated red, unmethylated green, methylated green. Remove intensity effects using snm package. Scale the datasets, combine into M value (log2(meth/unmeth))
Normalized everything on my work computer. Normalization of the type I probes:Code Block collapse true #Created an int.var for the array intensity effects > int.var<-data.frame(array=factor(1:247)) > mred<-snm(as.matrix(mtypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var) Warning message: In snm(as.matrix(mtypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) : bio.var=NULL, so all probes will be treated as 'null' in the normalization. > mgreen<-snm(as.matrix(mtypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var) Warning message: In snm(as.matrix(mtypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) : bio.var=NULL, so all probes will be treated as 'null' in the normalization. > ugreen<-snm(as.matrix(utypeIgreen),bio.var=NULL,adj.var=NULL,int.var=int.var) Warning message: In snm(as.matrix(utypeIgreen), bio.var = NULL, adj.var = NULL, int.var = int.var) : bio.var=NULL, so all probes will be treated as 'null' in the normalization. > ured<-snm(as.matrix(utypeIred),bio.var=NULL,adj.var=NULL,int.var=int.var) Warning message: In snm(as.matrix(utypeIred), bio.var = NULL, adj.var = NULL, int.var = int.var) : bio.var=NULL, so all probes will be treated as 'null' in the normalization.
Here is how I scaled and calculated the M values for the type I probes:
Code Block collapse true > range(mred$norm.dat) [1] -5318.258 40564.597 > range(ured$norm.dat) [1] -5011.863 34392.538 > mvalred<-log2((mred$norm.dat+6000)/(ured$norm.dat+6000)) > range(ugreen$norm.dat) [1] -7209.145 30610.116 > range(mgreen$norm.dat) [1] -7440.55 36960.34 > mvalgreen<-log2((mgreen$norm.dat+8000)/(ugreen$norm.dat+8000)) > mvalredScaled<-scale(mvalred) > mvalgreenScaled<-scale(mvalgreen) > mval<-rbind(mvalredScaled,mvalgreenScaled)
- Normalized type II probes using snm package and adjusting intensity and color dependent effects. Scale and combine into the M value
I learned from the minfi vignette that methylated probes are always measured in green color, i.e. Cy5. Good to know for normalization. In order to correctly normalize the arrays I need to put them in one big matrix. Lets have the red (u) first and the the green (m) next.
Code Block collapse true > comb<-cbind(u,m) > dim(comb) [1] 350076 494 > red<-data.frame(array=factor(1:ncol(u)),dye=rep("CY3",ncol(u))) > green<-data.frame(array=factor(1:ncol(m)),dye=rep("CY5",ncol(m))) > int.var<-rbind(red,green) > dim(int.var) [1] 494 2 > snm.fit<-snm(as.matrix(comb),bio.var=NULL,adj.var=NULL,int.var=int.var) Warning message: In snm(as.matrix(comb), bio.var = NULL, adj.var = NULL, int.var = int.var) : bio.var=NULL, so all probes will be treated as 'null' in the normalization. > utypeII<-snm.fit$norm.dat[,1:247] > mtypeII<-snm.fit$norm.dat[,248:494] #Scaling and calculation of the M value: > range(utypeII) [1] -2719.159 39726.383 > range(mtypeII) [1] -6187.47 33669.09 > mvaltypeII<-log2((mtypeII+7000)/(utypeII+7000))
On my work computer it took several hours to finish the normalization.
Combine type I and type II probes into a single matrix. Identify technical batches and their effect. Use snm package to retain important biological variables (sample type) and remove technical variables as well as age and gender.
Combine type I and type II data in one matrix. Do SVD. The outliers of PC1 look slightly better comparing to the original:Code Block collapse true Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide V1 2.918564e-02 2.541558e-06 3.532357e-02 9.422801e-07 1.765521e-19 6.932534e-07 V2 1.625108e-04 2.660718e-04 3.008828e-24 1.149216e-01 1.886011e-01 7.380575e-03 V3 5.518528e-02 7.475338e-03 6.355672e-02 2.339687e-02 2.712102e-01 7.562814e-02 V4 3.445400e-03 1.791384e-02 5.322822e-01 5.526711e-02 2.310760e-03 1.441949e-01 V5 9.032115e-05 2.573920e-14 3.013233e-08 2.880643e-14 3.240300e-01 2.831130e-11 V6 1.436765e-04 1.547973e-04 8.790014e-02 8.015588e-04 3.450516e-01 1.456371e-03 V7 1.305277e-03 1.156332e-01 8.619749e-01 2.110003e-01 1.269173e-02 7.318850e-01 V8 3.375560e-02 8.054324e-03 4.038703e-07 1.190591e-02 1.744036e-01 1.080657e-01
Now I see that PC1 is correlated significantly with Array and then also with the Slide and the Pool_ID. Array doesn't correlate with any biological variables so start by removing it. Few other things: I found that I have 7 cell culture samples, so remove those and this leaves me with 238 samples. In addition, I want to remove age and gender and two patients didn't have any gender information so remove those too. I also wanted to remove BMI but unfortunately there are too many NAs.
Build the model matrix. Normalization was done using the snm package:Code Block collapse true #Create the biological variables "bucket": > bio.var<-model.matrix(~factor(sampleMatchNNA$Sample_Group)) > adj.var<-model.matrix(~factor(sampleMatchNNA$gender)+sampleMatchNNA$age_at_initial_pathologic_diagnosis+factor(sampleMatchNNA$Array)) > mvalNorm<-snm(mvalTissueNNA, bio.var=bio.var,adj.var=adj.var, int.var=NULL, rm.adj=TRUE)
SVD and correlation with the technical variables:
Code Block collapse true > tech Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide V1 5.654163e-03 1.300140e-05 1.025313e-20 5.615517e-04 0.9981125 9.029768e-06 V2 8.798479e-03 2.122247e-14 2.692146e-05 8.443727e-13 0.9999954 9.401617e-17 V3 2.519435e-01 4.527614e-01 4.522880e-01 3.650234e-01 0.9999678 4.158308e-01 V4 1.809132e-01 1.077572e-03 4.386365e-01 9.374620e-04 0.9698694 2.280764e-02 V5 5.529533e-06 5.152086e-13 1.966378e-11 5.863403e-13 0.9997771 2.522627e-10 V6 2.477964e-01 9.624819e-03 2.139038e-06 8.175024e-02 0.9999946 4.538778e-02 V7 3.867747e-01 7.110760e-02 5.214108e-02 2.278815e-01 0.9986385 4.399445e-02 V8 2.028927e-02 6.467366e-04 6.316455e-02 3.395680e-03 0.5685680 9.088878e-04
I still see a strong correlation with the slide and the pool_id. Remove the slide effect.
Code Block collapse true > adj.var<-model.matrix(~factor(sampleMatchNNA$gender)+sampleMatchNNA$age_at_initial_pathologic_diagnosis+factor(sampleMatchNNA$Array)+factor(sampleMatchNNA$Slide)) > mvalNorm<-snm(mvalTissueNNA, bio.var=bio.var,adj.var=adj.var, int.var=NULL, rm.adj=TRUE)
Code Block collapse true > tech Sample_Well Sample_Plate Sample_Group Pool_ID Array Slide V1 0.0007407836 0.0002202235 1.167682e-22 0.05614181 0.9832427 0.0991713 V2 0.7121442086 0.9985967341 1.370995e-01 0.99988478 0.9999998 1.0000000 V3 0.8872558733 0.9983135653 4.048457e-01 0.99666249 0.9981009 1.0000000 V4 0.8342893562 0.9940254760 4.077207e-01 0.99424726 0.9947813 1.0000000 V5 0.0075332014 0.0303129288 1.830997e-12 0.30065288 0.9993607 0.8292775 V6 0.7399454196 0.6379940483 6.765006e-05 0.85893411 0.9999992 0.9999641 V7 0.5987213698 0.9819645768 3.531294e-02 0.99995988 0.9986432 0.9999682 V8 0.7564265374 0.9959541522 6.756497e-01 0.98944744 0.8980192 0.9999792
So we successfully got rid of the slide, array and the processing batch. At the same time the first PC is highly correlated with the sample type and we also see some correlation with the Sample_Well (center) and Sample_Plate (BCR barcode) which means that we still have some signal in our data!!
- The end. Consider the data to be normalized. The complete transcript of my analyses and musings can be found here.
...