Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable". 
  2. Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. Total number of obtained files is 523.522. Types of patients represented in this set:
    Code Block
    > colnames(expr)[1:4]
    [1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
    [3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
    > sp<-strsplit(colnames(expr),split="-")
    > samType<-sapply(sp,"[",4)
    > table(samType)
    samType
    01A 01B 01C 01D 02A
    481  24   2   1  14
    So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample)
  3. Get clinical files for the same 511 patient barcodes. Download to the same folder.
  4. Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix: 
    Code Block
    exprProcess<-function(){
      patientIDs<-list()
      data<-list()
      files<-list.files(pattern=".txt")
      y<-read.table(files[1],skip=1,header=F)
      geneSymbols<-y[,2]
      n=0
      for (i in seq(along=files)){
        print(files[i])
        patExpr<-read.table(files[i],skip=1,header=F)
        patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
        data[[length(data)+1]]<-patExpr[,3]
        n<-n+1
        print(n)}
    
      return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
    }
    After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names. 

...