Forcing clusters in colon cancer using Shen or Laird markers for CIMP phenotype

The purpose of this exercise is to see that if we use the markers of CIMP in colon cancer and force the data to have the number of predicted clusters will the cluster correlate with mutations in BRAF, KRAS or P53?

Test Shen markers in colon cancer using the 27k platform. Data: M value, batch removed or Beta value (converted from normalized M value using the equation in Pan, 2010)
Test Laird markers in colon cancer using the 27k platform. Data: M value, batch removed. Beta value wasn't tested because similar results were obtained for Beta and M value using Shen's markers.

Shen Markers

Clustering was performed according to the short K-means tutorial from here. In the paper they describe 3 clusters: CIMP1, CIMP2 and CIMP negative. CIMP1 are characterized by MSI (80%) and BRAF mutations (53%) and rare KRAS and p53 mutations (16% and 11%, respectively). CIMP2 is associated with 92% KRAS mutations and rare MSI, BRAF, or p53 mutations (0, 4, and 31% respectively). CIMP-negative cases have a high rate of p53 mutations (71%) and lower rates of MSI (12%) or mutations of BRAF (2%) or KRAS (33%).

Note: data weren't scaled before clustering

Attempt to determine the number of clusters:

kmeansEval = function(data,plotTitle){
  dataScale = scale(t(data))
  wss <- (nrow(dataScale)-1)*sum(apply(dataScale,2,var))
  for(i in 2:15){
    wss[i] <- sum(kmeans(dataScale, centers=i)$withinss)
  }
  plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares",main=plotTitle)
}

Plot of the number of clusters vs the within groups sum of squares (the analysis was run 1-7 times, graph shows the results from a single run). Left: M values, Right: Beta values

In the paper they identified 3 clusters, force the data to have 3 clusters. Code:

> fit <- kmeans(shenScale,3)
> library(cluster)
> clusplot(shenScale, fit$cluster, color=TRUE, shade=TRUE, labels=4, lines=0,main="Colon, batch removed, Shen markers, M value, 3 clusters")
> shenScale <- data.frame(shenScale, fit$cluster)

Three clusters based on Shen markers. Left: M value, Right: beta value

For the correlations with the mutation status of BRAF, KRAS and P53 genes I used the all_mut.txt table constructed by Qingying. I merged all types of mutations into one so either mutated or not. For the correlation between clusters and the mutation status I use Chi-square test. I show P values for the test in the table.

	Shen, batch removed; M value; 3 clusters	Shen, batch removed; Beta value; 3 clusters
BRAF	0.1958	0.3837
KRAS	0.00395	0.007297
TP53	0.5645	0.5769

Some significance for association with KRAS. Table for the number of patients in each cluster with or without a certain mutation

		M value			Beta value
		1	2	3	1	2	3
BRAF	0	0.17	0.26	0.22	0.29	0.19	0.17
BRAF	1	0.11	0.15	0.06	0.11	0.10	0.11
KRAS	0	0.14	0.21	0.23
KRAS	1	0.13	0.20	0.05
TP53	0	0.17	0.22	0.14	0.21	0.15	0.17
TP53	1	0.10	0.19	0.14	0.19	0.14	0.10

Also looked at the dataset where batch, age and gender were removed and didn't find anything interesting there either

Laird markers

Five genes. They identified 2 clusters. Didn't do the analysis with the beta value since the results for beta and M value were similar in the analysis of Shen markers

	BRAF	KRAS	TP53
Laird markers	0.4662	0.5753	0.1088