Document toolboxDocument toolbox

Phenotype Editor

Project Description

The phenotype editor is a project that enables curators to better track the Clinical Phenotype information in a dataset.

Use Cases (In order of importance)

  1. make sure all values exist in user-defined enumeration
  2. set units + desc for a col
  3. Make sure all values in a col match an existing ontology
  4. standardized clinical variable names across studies (column)
  5. complete ontology for sage use => EFO partial soln (Brig)
  6. clean up misspellings, synonyms, capitalization => google refine
  7. generate script to curate data, apply same transformations to new/updated dataset => google refine
  8. Show description of term
  9. some sort of record of what was changed, and from what to what
  10. unit conversion
  11. linking across studies by id - cell lines - same patients multiple studies

Server Design

It appears that to satisfy the most important use cases we can use nearly all existing features within the repository service. This includes

  • Persisting the starting matrix file, if there is one (Blob annotation of string, perhaps chunked into JSON object)
  • Persisting the transformed/annotated matrix file (Blob annotation of JSON string)
  • Persisting the ObjectSchema that describes columns and what are valid values in those columns (Blob Annotations)
  • [Future] Persist the Redo/Undo queues (provenance) (Blob annotation of JSON string)

What we need and do not have is a validator that can take an ObjectSchema object and a filled JSONAdaptor (aka a JSON object) and validate that the Adaptor object conforms to the requirements of the ObjectSchema. (John)

Google refine looks very good if we create an extension for reconciling data. The other problems are:

  • No support for adding new rows or new data
    • It is key, when supporting undo/redo history, that additions be just another modification operation. In google refine you would have to start over.

See data model below for details on the shape of the objects listed above.

NCBO Web Services

The plan is to use the NCBO services directly, without any routing through our service layers. This means that REST requests will be made directly from the GWT server side. In the future, if we find that the NCBO services become non-performant or are not amenable to batch requests, we can implement a local cache/batching skin.

We're planning to use the following NCBO services:

  • Annotator - for determining which Ontology is best for a column. Initially this will be user guided. The user desires to select an ontology for a column, the unique values of the column are sent to the Annotator, and a ranked list of suggested ontologies is displayed to the user for selection. (Alternatively, the user can just select from a list of all NCBO ontologies)
  • Ontology Term Search - for mapping each unique value in the column to a controlled term within the column's specified ontology

Client

The web UI will hold a significant amount of the business logic for the Phenotype editor. When at all possible, persistence details will occur on the server to keep the CRUD operations decoupled (i.e. a column's ObjectSchema will be created in the client with an ObjectSchema pojo, but it will be serialized and persisted into the Blob annotations on the server side behind a "saveColumnDefinition" like method).

Data Model