Search Test Data Submission Guideline

Search Test Data Submission Guideline


Objective

To enhance the search and discovery experience on Synapse, we need a well-defined test dataset that supports measuring, benchmarking, and optimizing search performance. This dataset will ensure consistent evaluation of search engines, improve user engagement, and establish a baseline for future improvements.

 

The scope of this test data submission is currently limited to Synapse datasets—that is, collections of files, folders, and projects in Synapse identified by a synID, e.g., syn12345678. For tables, the combination of row_id and row_version serves as a stable and unique identifier, although it falls outside the scope of this schematized test data submission.

Acceptance Criteria

  • Diverse Query Set

    • The dataset must include queries representing real-world user searches.

      • Examples:

        • Single keywords

        • Multi-term searches 

        • Phrase-based searches

        • Acronyms & Abbreviations (e.g., "ALS" for "Amyotrophic Lateral Sclerosis")

        • Synonyms & Variations (e.g., "Alzheimer’s" vs. "AD")

        • Misspellings & Typos (e.g., "diabetes" vs. "diabtes")

        • Ambiguous Terms (e.g., "MHC" could mean "Major Histocompatibility Complex" or "Molecular Hybridization Capture")

        • Multi-language Queries (if applicable, e.g., Latin medical terms)

  • Gold Standard Results

    • Each query must be paired with a set of expected relevant results, determined by expert assessment.

  • Structured and Unstructured Queries

    • Queries should reflect different user intent types, including:

      • Structured queries (e.g., metadata-driven searches)

      • Unstructured queries (e.g., free-text searches)

  • Version-Controlled Storage

    • The test dataset must be stored in a version-controlled repository (e.g., GitHub, Synapse) to enable:

      • Repeatability in evaluations

      • Future benchmarking and optimizations

Submission Process

  1. Submit datasets in JSON Schema format as shown below.

  2. Commit updates in the chosen version-controlled repository.

  3. Provide the file URL and file version number or tag name as part of the submission.

 

Here is the JSON schema for the test cases:

{   "$schema": "http://json-schema.org/draft-07/schema#",   "type": "array",   "items": {     "type": "object",     "properties": {       "queryString": {         "type": "string",         "description": "The query string for the test case."       },       "ids": {         "type": "array",         "items": {           "type": "string",           "description": "An syn identifier."         },         "description": "A list of IDs associated with the test case."       }     },     "required": [       "queryString",       "ids"     ],     "additionalProperties": false,     "description": "A test case with a query string and a list of IDs."   },   "description": "An array of test cases, each with a query string and a list of IDs." }

 

Here is an example of what the test cases might look like:

[   {     "queryString": "metabolic dynamics",     "ids": ["syn123", "syn456", "syn789""]   },   {     "queryString": "mega data",     "ids": ["syn1001", "syn1002""]   } ]