Document toolboxDocument toolbox

Administration of enclave challenges

Overview

External data enclaves are data analysis platforms where the research data is stored, curated, and analyzed on a self-contained platform and that can’t be interacted with via an API or other external access protocol. Historically, Sage has run challenges through the Synapse platform via the Data-to-Model approach or through the Model-to-Data approach, where the data is either stored in Synapse or a Sage controlled cloud platform, or it is stored in a docker accessible server controlled by the data owners (see EHR DREAM Challenges with UW).

To date, there have been two external data enclave challenges: the Pediatric COVID-19 Data challenge and the NIH Long COVID Computational Challenge. Both of these challenges were run in the National COVID Cohort Collaborative data enclave. In this article, we’ll go through what was necessary for those challenges, and highlight key lessons that will most likely generalize to future enclave challenges. Given that each enclave will be unique, future enclave challenges will need to adapt to the enclave environment and capabilities, but there will be commonalities between all enclaves. I’ll specifically go through the unique aspects of an enclave challenge and won’t be covering typical challenge procedures like working with governance (who should always be contacted early in the challenge planning stages), communications and marketing, winner announcements, model evaluations, etc.

Enclave Challenge Preparation

For the N3C challenges, it was necessary to collaborate with the IT support team from Palantir to create special permissions and project folder structures to properly carry out the challenges. If an enclave gives a project owner (i.e. challenge administrator) the necessary permissions to properly structure the challenge, then collaboration with the IT team won’t be necessary. However, as you plan the challenge, make sure you are able to carry out the following configuration steps.

Setting up the enclave challenge

Identify testing data isolation method

Depending on the data in the enclave and how often it is updated, there may a variety of possible methods for collecting the held out test set for evaluating the submitted methods from participants. The main issue to overcome is that most enclaves make all data available in the enclave available to all participants, so creating a “hidden“ gold standard becomes difficult. Here are a few options to consider when designing the evaluation test set in an enclave challenge. For each option, I’ll lay out the procedure, say how we used it in a challenge and then specify the requirements needed in the enclave.

Prospective Data Collection

In the Pediatric COVID-19 Data Challenge, we “froze“ a set of data on a chosen version of the N3C data. Participants used the “frozen“ version of the data to train and internally test their models before submitting their final model. After this phase, we collected all the data that had accumulated over the course of the training phase and evaluated the submitted models on this prospectively collected data.

Enclave requirements

  • Regularly updated data. The N3C enclave received regular data updates every two weeks which included newly onboarded data partners and new data from existing data partners. The newly accumulated data was necessary to test the submissions.

  • Data versioning. The N3C enclave has the ability select specific versions of the data tied to the date is was released. This is important for cross-dataset version tracking, especially for datasets from multiple tables and sources. It may be possible to use this method without data versioning, but tracking and managing the data may be cumbersome.

  • Isolating data. In the N3C enclave, we set up two projects, one for the organizers and one for the challenge participants. The organizers project space had access to the full enclave dataset. From there we selected a “version“ of the data to use, created the challenge cohort, and then transferred that data to the challenge participant project. The challenge participant project was specifically set up so that the enclave data could not be imported, but only the organizer created challenge data could be used. This was done in collaboration with the Palantir support team who created specialized permission structures so we could control the access of the data.

Hold Out Test Set

In the Long COVID Computational Challenge, we set aside a subset of the most up to date data in the enclave. We eventually brought in a prospectively collected dataset but the main challenge was conducted with the held out test set.

Enclave requirements

  • Isolating data. Same as above, except that the test set was set aside and not imported until after the training phase was complete.

Caveats

With both of these methods, there is still a chance that the participants can access the hidden test set by having another project opened in the enclave and running experiments on the full enclave dataset. If possible, you could see if you could work with the enclave managers/directors to withhold a set of data as the hidden testing dataset, postponing release until the challenge is over.

Identify team folder structure method

Within the enclave constraints, you’ll need to figure out how to set up team specific working environments. These environments need to be protected from other teams, can’t have access to the full enclave dataset, but need to give enough resources to the teams to be able to develop their methods.

In both the Pediatric and L3C challenge, we had separate team folders that had special permissions structures set up so that individual users could be added to their team folder with access, members of other teams weren’t able to view or edit files in that folder, and the challenge data could be imported into the folders to be worked with. This setup was possible because we worked with Palantir tech support to set up the proper permission tags.

Synapse wiki and onboarding instructions

For both the Pediatric and L3C challenges, we set up a Synapse wiki page with instructions for how to register and onboard for the challenge and the N3C enclave. The high level steps were the same in both challenges. Challenge registration was handled through Synapse. We had a governance click wrap on the register button so people could agree to the terms and conditions.

See the How to Participate page for an example of what information was made available.

Setting up the Challenge Onboarding Process

In both challenges, the onboarding process was split into two processes: (1) challenge registration on Synapse and (2) onboarding into the enclave. The challenge registration process was the same as the normal process, except that teams were required even if only one person was on the team. This was so the Enclave Challenge Management script would work with the Synapse API.

Once the wiki is set up, use the Enclave Challenge Toolkit to link the Synapse registration process with the Enclave onboarding the process. Obviously, the Enclave onboarding process will be unique to the enclave, but you can find examples from the N3C challenges on the Pediatric and L3C challenges How to Participate pages.

Enclave Challenge Examples: Pediatric COVID-19 Data Challenge and the L3C Challenge.

N3C Data Enclave. Both the Pediatric COVID-19 Data Challenge and the NIH Long COVID Computational challenge (L3C) were run in the N3C data enclave, an enclave that represented over 75 healthcare centers and hospitals from across the US who deposited their harmonized electronic health record data into the centralized enclave during the COVID epidemic. The enclave contained over 19 million patients including ~7.5 million COVID positive patients.

Pediatric COVID-19 Data Challenge. We asked participants two challenge questions:

  1. Of pediatric patients (<18 years of age at the time of COVID positivity) who test positive for COVID-19 in an outpatient setting, who is at risk for hospitalization?

  2. Of pediatric patients who test positive for COVID-19 and are hospitalized, who is at risk for needing ventilation or cardiovascular interventions?

NIH L3C Challenge. We asked participants one challenge question:

  1. Of patients who have tested positive for SARS-CoV-2 in an outpatient or inpatient (ICU or non-ICU) setting, what is the probability of developing Long COVID?