Network and network analysis (M value)

Network and network diagnostics with M value based data (center, batch, plate row and column variables removed)

Check if the network is scale free. To do that I need to plot clustering coefficient against network connectivity. To calculate clustering coefficient:

Use Bioconductor "graph" library
Calculate positive adjacency matrix
```
adjM<-abs(cor(t(data)))
```

Since taking a chunk of the adjacency matrix will give a good idea about the clustering coefficient and will significantly speed up the calculations Chris suggested to take 4k by 4k matrix and use hard thresholding. I used first 4k columns and rows.

Take a look at the distribution of the small adjacency matrix:

From the conversation with Chris it seems that I have very very few strong correlations and most of them are not that strong. Therefore when we are trying to use a scale free network (raising each correlation to the power of beta) it creates even smaller values and it impedes clustering into modules.
Hard threshold was 0.46 on the 4k by 4 k matrix to get about 1% of the nodes (adjM4000)

To use it with the graph library change diagonal of the matrix to 0 (no self looping nodes!)

Create an instance of the class graph and calculate clustering coefficient:

g4000<-new("graphAM",adjMat=adjMat4000)
cc4000<-clusteringCoefficient(g4000)

To calculate connectivity get a sum of all columns of the hard-thresholded matrix (adjM4000):

kk<-apply(adjMat4000,2, sum)

Finally, get rid of all values in cc4000 and kk that are equal to 0 (which I didn't for this plot) and make a scatter plot:

If I do averaging of cc over k I can get an approximate straight line which is the evidence of the scale-free network (Barabasi, 2004). However, this plot is only a part of an the evaluation. I actually need to plot P(k) vs k to get a definite answer.

Another thing that was interesting to check is the variance of CpG loci vs connectivity. Apparently, in gene expression there is a linear relationship between these properties: more variable genes tend to be more connected. Tkae the same matrix of 4k by 4k, get var() value for each CpG and plot it. Also, plot mean value of each CpG across all patients in relation to the connectivity:

It is interesting to see that with DNA methylation the most connected genes are actually the least variable across patients. Does it really make biological sense? Do we expect this with this type of data? Could it be a reflection of the technology rather than biology (DNA methylation in itself is not a continuous trait although it might become on if we have many cells and a heterogeneous population of them). What if we take the most variable CpGs and build network out of them rather than all 27k?

Cytoscape view of the 4k network. Color: CpG variance (darker =more variable), size = number of connections

WGCNA

R version 2.13.0 64 bit, SageBionetworksCoex 0.11, running time ~4.5 hours (~27k by 486)

beta=6, R=0.89. Number of modules: 18 + 1 (grey):

Module	# of probes	Module	# of probes
black	108	pink	97
blue	1206	purple	89
brown	615	red	113
cyan	43	salmon	46
greenyellow	86	tan	85
grey	22115	turquoise	2151
grey60	33	yellow	469
lightcyan	34	lightgreen	32
magenta	92	green	121
midnightblue	43

Analysis of the first PC of each module:

With M value almost no modules were composed of the CpGs from a single chromosome (chromosomal density plots can be found here)

Modules lightcyan, lightgreen, salmon and yellow, >55% of the loci came from the same chromosome (plots are here)

Gene Ontology analyses of each module (top 10 categories were selected and KEGG pathways didn't work for my on that day):

Network and network diagnostics based on M value data (normalization is similar to the one above but center was retained)

The reason for going into trouble and doing another round of normalization without removing the center is Justin's comment who suggested that center may be associated with patients who come with different genetic backgrounds (different CNV profiles). If this is the case by removing the center we will get rid suck out the "genetic" component and this could be the reason why we don't see the characteristic clustering of CpGs within almost every module defined by the comethylation network. This was seen with the data for which only the methylated probes were used (see here)

Beta is 7, R^2=0.89. Number of modules = 8

Module	#of loci
black	41
blue	938
brown	408
green	135
pink	34
red	45
turquoise	1750
yellow	159

Percent variance explained by the first PC of each module and variability of the first PC of each module:

I think it is interesting that in both networks a module has been identified that is highly enriched in categories such as keratinization, epidermal cell differentiation, epithelial cells differentiation, epidermis development etc (module "green" in both cases).