Search Test Data Submission Guideline

Objective

To enhance the search and discovery experience on Synapse, we need a well-defined test dataset that supports measuring, benchmarking, and optimizing search performance. This dataset will ensure consistent evaluation of search engines, improve user engagement, and establish a baseline for future improvements.

The scope of this test data submission is currently limited to Synapse datasets—that is, collections of files, folders, and projects in Synapse identified by a synID, e.g., syn12345678. For tables, the combination of row_id and row_version serves as a stable and unique identifier, although it falls outside the scope of this schematized test data submission.

Acceptance Criteria

Diverse Query Set
- The dataset must include queries representing real-world user searches.
  - Examples:
    - Single keywords
    - Multi-term searches
    - Phrase-based searches
    - Acronyms & Abbreviations (e.g., "ALS" for "Amyotrophic Lateral Sclerosis")
    - Synonyms & Variations (e.g., "Alzheimer’s" vs. "AD")
    - Misspellings & Typos (e.g., "diabetes" vs. "diabtes")
    - Ambiguous Terms (e.g., "MHC" could mean "Major Histocompatibility Complex" or "Molecular Hybridization Capture")
    - Multi-language Queries (if applicable, e.g., Latin medical terms)
Gold Standard Results
- Each query must be paired with a set of expected relevant results, determined by expert assessment.
Structured and Unstructured Queries
- Queries should reflect different user intent types, including:
  - Structured queries (e.g., metadata-driven searches)
  - Unstructured queries (e.g., free-text searches)
Version-Controlled Storage
- The test dataset must be stored in a version-controlled repository (e.g., GitHub, Synapse) to enable:
  - Repeatability in evaluations
  - Future benchmarking and optimizations

Submission Process

Submit datasets in JSON Schema format as shown below.
Commit updates in the chosen version-controlled repository.
Provide the file URL and file version number or tag name as part of the submission.

Here is the JSON schema for the test cases:

{

  "$schema": "http://json-schema.org/draft-07/schema#",

  "type": "array",

  "items": {

    "type": "object",

    "properties": {

      "queryString": {

        "type": "string",

        "description": "The query string for the test case."

      },

      "ids": {

        "type": "array",

        "items": {

          "type": "string",

          "description": "An syn identifier."

        },

        "description": "A list of IDs associated with the test case."

      }

    },

    "required": [

      "queryString",

      "ids"

    ],

    "additionalProperties": false,

    "description": "A test case with a query string and a list of IDs."

  },

  "description": "An array of test cases, each with a query string and a list of IDs."

}

Here is an example of what the test cases might look like:

[

  {

    "queryString": "metabolic dynamics",

    "ids": ["syn123", "syn456", "syn789""]

  },

  {

    "queryString": "mega data",

    "ids": ["syn1001", "syn1002""]

  }

]