Objective
To enhance the search and discovery experience on Synapse, we need a well-defined test dataset that supports measuring, benchmarking, and optimizing search performance. This dataset will ensure consistent evaluation of search engines, improve user engagement, and establish a baseline for future improvements.
Acceptance Criteria
Diverse Query Set
The dataset must include queries representing real-world user searches.
Examples:
Single keywords
Multi-term searches
Phrase-based searches
Acronyms & Abbreviations (e.g., "ALS" for "Amyotrophic Lateral Sclerosis")
Synonyms & Variations (e.g., "Alzheimer’s" vs. "AD")
Misspellings & Typos (e.g., "diabetes" vs. "diabtes")
Ambiguous Terms (e.g., "MHC" could mean "Major Histocompatibility Complex" or "Molecular Hybridization Capture")
Multi-language Queries (if applicable, e.g., Latin medical terms)
Gold Standard Results
Each query must be paired with a set of expected relevant results, determined by expert assessment.
Structured and Unstructured Queries
Queries should reflect different user intent types, including:
Structured queries (e.g., metadata-driven searches)
Unstructured queries (e.g., free-text searches)
Version-Controlled Storage
The test dataset must be stored in a version-controlled repository (e.g., GitHub, Synapse) to enable:
Repeatability in evaluations
Future benchmarking and optimizations
Submission Process
Submit datasets in JSON Schema format as shown below.
Commit updates in the chosen version-controlled repository.
Provide the file URL and file version number or tag name as part of the submission.
Here is the JSON schema for the test cases:
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "array", "items": { "type": "object", "properties": { "queryString": { "type": "string", "description": "The query string for the test case." }, "ids": { "type": "array", "items": { "type": "string", "description": "An syn identifier." }, "description": "A list of IDs associated with the test case." } }, "required": [ "queryString", "ids" ], "additionalProperties": false, "description": "A test case with a query string and a list of IDs." }, "description": "An array of test cases, each with a query string and a list of IDs." }
Here is an example of what the test cases might look like:
[ { "queryString": "metabolic dynamics", "ids": ["syn123", "syn456", "syn789""] }, { "queryString": "mega data", "ids": ["syn1001", "syn1002""] } ]