- Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
- Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011.
- Get clinical files for the same 511 patient barcodes. Download to the same folder.
- Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:
After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names.Code Block exprProcess<-function(){ patientIDs<-list() data<-list() files<-list.files(pattern=".txt") y<-read.table(files[1],skip=1,header=F) geneSymbols<-y[,2] n=0 for (i in seq(along=files)){ print(files[i]) patExpr<-read.table(files[i],skip=1,header=F) patientIDs[[length(patientIDs)+1]]<-patExpr[1,1] data[[length(data)+1]]<-patExpr[,3] n<-n+1 print(n)} return(list(Data=data,patNames=patientIDs,genes=geneSymbols)) }
Total number of obtained files is 522. Types of patients represented in this set:
Code Block |
> colnames(expr)[1:4]
[1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
[3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
> sp<-strsplit(colnames(expr),split="-")
> samType<-sapply(sp,"[",4)
> table(samType)
01A 01B 01C 01D 02A
481 24 2 1 14 |
So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample)
For the comparison, I also looked at the samples in the methylation dataset:
Code Block |
> mb<-read.table("methylated_batch.txt",header=T,row.names=1)
> length(colnames(mb))
[1] 511
> colnames(mb)[1:4]
[1] "TCGA.04.1331.01A.01D.0432.05" "TCGA.04.1332.01A.01D.0432.05"
[3] "TCGA.04.1335.01A.01D.0432.05" "TCGA.04.1336.01A.01D.0432.05"
> sp<-strsplit(colnames(mb),"[.]")
> samType<-sapply(sp,"[",4)
> table(samType)
01A 01B 01C 01D
483 25 2 1
So for the expression analysis I need to get rid of all patients that have secondary tumor (02) and I am missing one patient with 01A and 01B in the expression dataset. So the solution is to match the patient IDs using only the first 4 fields (because the rest of them obviously differ (sample type (D or R))). As the result I got 507 patients for gene expression data (no secondary tumor), 3 missing from the 01A and 1 is missing from 01B group. I think I will stick to this group for the future normalization steps and the analyses
Code Block |
for (i in seq(along=files)){
} |
Next step is to do PCA that I will use for identification of variables that affect the data.