Unrelated to the project, but useful stuff I learn and very often need

How to filter the lines in a file that contain the right format for the chromosome number

For example I have chr6_hpga_..., but I need to retain only those that have chrN or chrNN or chrX and chrY. In shell:

grep "chr[XY1-9][0-9]*\>" Reference_UCSC_hg18 > output

[XY1-9] - matches one character that is either X, Y or 1-9

[0-9]* - matches 0 or more times integers 0 to 9

\> - matches the end of the word

How to create eSet object from DNA methylation data

Example: eSet object with TCGA LAML data (see analysis here)

> library(Biobase)
> class(laml)
[1] "matrix" #has to be a matrix
# Useful to add clinical data = information about the samples, which will got to pData. Colnames of which should match colnames of the exprs. 
# Need to reformat the columns of the assayData (laml) to match clinical file
> dim(clinical)
[1] 202  23
> one<-sapply(newc,"[",1) #I split the col names by "-" before, now I am about to assemble a short patient ID used in the clinical file
> two<-sapply(newc,"[",2)
> three<-sapply(newc,"[",3)
> shortID<-paste(one, two, three, sep="-")
> colnames(laml)<-shortID
> k<-match(colnames(laml),rownames(clinical))
> length(k)
[1] 190
> clinicalMeth<-clinical[k,] # Get clinical info for the patients for whom DNA methylation data is available
[1] 202  23
> one<-sapply(newc,"[",1)
> two<-sapply(newc,"[",2)
> three<-sapply(newc,"[",3)
> shortID<-paste(one, two, three, sep="-")
> colnames(laml)<-shortID
> k<-match(colnames(laml),rownames(clinical))
> length(k)
[1] 190
> clinicalMeth<-clinical[k,]
> all(rownames(clinicalMeth) == colnames(laml)) #critical step
[1] TRUE
> phenoData <- new("AnnotatedDataFrame", data = clinicalMeth) #critical step
#For the features (CpGs) annotation I download adf files from TCGA: http://tcga-data.nci.nih.gov/tcga/tcgaPlatformDesign.jsp. Read the file into R and
#construct a feature class object
> adf<-read.table("jhu-usc.edu_TCGA_HumanMethylation450.adf.txt",header=T,row.names=1,sep="\t")
> dim(adf)
[1] 486427     11
> dim(laml)
[1] 486411    190
#Next need to match adf and laml. For some weird reason they have different CpG ID (?!). Final datasets are adfMeth and lamlT.
#Need to make sure that the rownames in lamlT and adfMeth are the same using all command.
>feature<-new("AnnotatedDataFrame",data=adfMeth) #create featureData class object: critical step
#Finally create the expression set class object:
> exprs<-new("ExpressionSet",exprs=lamlT, phenoData=phenoData,featureData=feature,annotation=annotation)
> exprs
ExpressionSet (storageMode: lockedEnvironment)
assayData: 483308 features, 190 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: TCGA-AB-2802 TCGA-AB-2803 ... TCGA-AB-3012 (190 total)
  varLabels: age_at_initial_pathologic_diagnosis
    cumulative_agent_total_dose ...
    year_of_initial_pathologic_diagnosis (23 total)
  varMetadata: labelDescription
featureData
  featureNames: cg00000029 cg00000108 ... Target.Removal.2 (483308
    total)
  fvarLabels: CHR GENESYMBOL ... CONTROL_PROBE (11 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: IlluminaHumanMethylation450k

Plotting density histograms for each patients in a matrix

Apparently if a matrix with DNA methylation or gene expression values is converted to an object of ExpressionSet class function density has methods for plotting density distribution for each patient (column) in the matrix. It is very convenient and I don't need to write a function to do so.

#Create a minimum Expression Set
>methExpr <- new("ExpressionSet",exprs=meth) # require(Biobase), require(lumi)
#To make a density plot and color according to a certain parameter
>plotColor<-seq(1,ncols(meth),1)
#x is the indices of columns that are tumor samples. Created using grep and relying on TCGA barcode
>plotColor[x]<-"red"
>plotColor[-x]<-"blue"
#Now plot the results, lines are colored according to tumor-normal. The same was also done for the line type
> png("Colon_Mvalue_dataDistrubitionNormalTumor.png")
> density(methExp,addLegend=F,col=plotColor,main="Colon M value no normalization",lty=lineType)
> legend("topright",legend=c("Tumor","Normal"),lty="solid",col=c("red","blue"),bty="n") #bty - no box around the legend
> dev.off()


#to show distribution by batch extract batch information then assign colors to each batch by converting it to a factor
> colFields<-strsplit(colnames(meth),split="-")

> batch<-sapply(colFields,"[",6)

> batchC<-factor(batchC,labels=brewer.pal(10,"Set3")) #require(RColorBrewer)

> png("Colon_Mvalue_dataDistributionByBatch.png")
> density(methExp,addLegend=F,col=batchC,main="Colon M value no normalization")
> legend("topright",legend=levels(as.factor(batch)),lty="solid",col=brewer.pal(10,"Set3"),bty="n")
> dev.off()