Content Comparison

...

To enhance the search and discovery experience on Synapse, we need a well-defined test dataset that supports measuring, benchmarking, and optimizing search performance. This dataset will ensure consistent evaluation of search engines, improve user engagement, and establish a baseline for future improvements.

Note
The scope of this test data submission is currently limited to Synapse datasets—that is, collections of files, folders, and projects in Synapse identified by a synID, e.g., `syn12345678`. For tables, the combination of `row_id` and `row_version` serves as a stable and unique identifier, although it falls outside the scope of this schematized test data submission.

Acceptance Criteria

Diverse Query Set
- The dataset must include queries representing real-world user searches.
  - Examples:
    - Single keywords
    - Multi-term searches
    - Phrase-based searches
    - Acronyms & Abbreviations (e.g., "ALS" for "Amyotrophic Lateral Sclerosis")
    - Synonyms & Variations (e.g., "Alzheimer’s" vs. "AD")
    - Misspellings & Typos (e.g., "diabetes" vs. "diabtes")
    - Ambiguous Terms (e.g., "MHC" could mean "Major Histocompatibility Complex" or "Molecular Hybridization Capture")
    - Multi-language Queries (if applicable, e.g., Latin medical terms)
Gold Standard Results
- Each query must be paired with a set of expected relevant results, determined by expert assessment.
Structured and Unstructured Queries
- Queries should reflect different user intent types, including:
  - Structured queries (e.g., metadata-driven searches)
  - Unstructured queries (e.g., free-text searches)
Version-Controlled Storage
- The test dataset must be stored in a version-controlled repository (e.g., GitHub, Synapse) to enable:
  - Repeatability in evaluations
  - Future benchmarking and optimizations

...

Version	Old Version 1	New Version Current
Changes made by	Mieko Hashimoto	Mieko Hashimoto
Saved on	Mar 20, 2025	Apr 15, 2025

Versions Compared

Key

Acceptance Criteria