Search: Key Observations

These observations are based on several weeks of hands-on usage with fresh pairs of eyes, analysis of reported issues in the Jira system, and feedback gathered from product support through service desk triage. Additionally, informal internal feedback from users and stakeholders has been considered to capture recurring pain points and areas for improvement. These findings reflect a Product/Program Manager’s user scenarios, with the understanding that modern search features (e.g., semantic search, fuzzy matching, contextual ranking) are not yet fully supported. This may have influenced the perception of search effectiveness and surfaced pain points that could be mitigated by future enhancements.

While this evaluation provides valuable insights, perspectives from other key personas, such as biomedical researchers and data scientists, are essential for making comprehensive product decisions. Incorporating domain-specific needs and workflows will be crucial in shaping the next phase of search improvements.

Search Functionality and Scope Issues
- The Synapse homepage search field does not handle people/team searches, which may be unexpected for users who assume a global search experience.
- Lack of clarity on whether different search entry points (homepage search vs. magnifying glass search on the left pane) use the same code path, creating potential inconsistencies in user expectations.
- Users may expect dynamic search results that update in real-time as they type. The absence of this feature forces users to commit to a full query before seeing results, reducing efficiency and increasing frustration.
- Lack of acronym expansion and synonym recognition causes inconsistent search results.
  - Example: Searching for “ROSMAP” returns 211 results, while searching for “The Religious Orders Study and Memory and Aging Project” returns 61 results, despite them referring to the same entity.
  - Users expect acronyms like ROSMAP → Religious Orders Study and Memory and Aging Project or UKB → UK Biobank to be recognized as equivalent.
  - Open question: Should all acronyms be expanded, or should a predefined subset be supported based on specific criteria?
- Team search does not support multiple token searches when out of sequence, making it difficult to find relevant teams unless users enter the exact token order.
  - Example: Searching for “DREAM Challenge” returns expected results.
  - Searching for “Challenge DREAM” does not return expected matches, despite both words being present in team names.
  - Users expect search to work with out-of-order tokens, recognizing that multiple terms should match collectively rather than strictly in sequence.
- Search does not support auto-correction or fuzzy matching for minor spelling errors, leading to failed queries when users mistype a word.
  - Example: A user searched for “BTATS 2023” but failed to find the intended dataset “BRATS 2023” because of a single incorrect character.
  - Other examples of common typos that search should handle:
    - azheimer → alzheimer (missing one letter)
    - UKBiobank → UK Biobank (omission of space)
- Stopwords are unexpectedly included in search results, causing irrelevant matches.
  - Example: Searching for “EL” returned results containing “EL” in Spanish, even though “EL” is typically a stopword in Spanish-based text processing.
  - This behavior may indicate inconsistent stopword handling or a lack of proper language-aware filtering in search indexing.
- Tokenized search results are ranked higher than exact matches, leading to unexpected ranking behavior.
  - Example: Searching for “PK/PD” returned higher-ranked results for “PK” and “PD” separately, while exact matches for “PK/PD” (Raw Data) were lower in the ranking.
  - Using quotation marks (“PK/PD”) results in exact search, but users did not expect to need this workaround.
  - Users expect exact matches to be prioritized over tokenized variations, particularly when searching for technical terms, abbreviations, or domain-specific phrases.
- Lack of project and access request-related annotations causes significant inefficiencies, especially for the Access Control Team.
  - Example: No structured metadata exists for:
    - Contact person for a project – leading to excessive manual effort to track down responsible individuals.
    - Access request status – requiring manual searches and follow-ups to determine approval status.
  - Without these annotations, product support teams struggle to efficiently find project-related details, increasing response time and administrative burden.

Usability and Accessibility Gaps
- Users may assume there is no search pre-filter on the homepage because the search box simply states “Search Synapse.” This could lead to confusion when expected entities (e.g., people, teams) do not appear in results.
- A clearer distinction between global search (spanning multiple data sources) and full-text search (focused on specific entity content) could help set expectations.
- Inconsistent search ranking when using acronyms vs. expanded terms results in mismatched expectations. Users assume that searching for “ROSMAP” should yield the same ranking and coverage as the full project name.
- Strict token sequence dependency in Team Search creates unnecessary friction when users are trying to find teams with multiple words in the name.
- Misspellings and minor typos lead to complete search failures, rather than suggesting corrections or alternative matches.
- Unexpected stopword behavior leads to noisy search results, potentially causing false positives when filtering search queries.
- Tokenization prioritization results in lower rankings for exact matches, requiring users to modify queries (e.g., using quotation marks) to achieve expected behavior.
- Lack of key project metadata (contact person, access request status, etc.) results in unnecessary manual searches, slowing down workflows for access control teams.

Data Annotation and Metadata Gaps
- Some search failures stem from incomplete or missing metadata, such as the inability to retrieve a specific version of synID data due to a lack of precise version tracking.
  - Example: Without structured annotations, users searching for detailed identifiers (e.g., “51111084.6”) fail to locate the correct dataset version.
- Absence of project-specific metadata fields (contact person, access request status) increases administrative burden.

Pain Points Identified
- Inconsistent search behaviors across different UI elements.
- Users experiencing dead-ends when expected search results (people, teams, specific dataset versions) are not returned.
- No clear explanation of search scope (what is and isn’t included in a given query).
- Lack of dynamic search updates and spell correction forces users to manually refine their queries, making search inefficient.
- Acronym mismatches and lack of synonym support result in incomplete or misleading search results.
- Team Search requires exact token sequences, reducing discoverability when users enter search terms in a different order.
- Minor typos lead to total search failures instead of returning relevant matches, causing frustration and missed discoveries.
- Stopword mismanagement results in irrelevant matches, increasing noise in search results.
- Tokenized search terms rank higher than exact matches, requiring additional query modifications to retrieve expected results.
- Lack of structured project metadata forces manual effort for access control teams.

Potential Areas for Improvement

1. Refining Query Interpretation

Improve handling of different entity types (people, teams, datasets) to make search behavior more intuitive.
Implement dynamic search results that update in real-time as users type, allowing them to adjust queries efficiently.
Incorporate spell correction and fuzzy matching to reduce failed searches due to typos.
Implement acronym expansion and synonym recognition to ensure consistency in search results.
Support multi-token search without strict sequence dependency so users can find results even when search terms are entered in a different order.
Prioritize exact matches over tokenized versions to align with user expectations.
- Example: PK/PD should rank higher than separate matches for “PK” and “PD” unless explicitly searched separately.
- Adjust ranking algorithms to favor exact term matching, especially for technical phrases, abbreviations, or domain-specific terms.
- Ensure users do not have to rely on quotation marks to retrieve expected results unless explicitly searching for an exact phrase.
Improve stopword handling to reduce irrelevant matches.
- Ensure stopwords (e.g., “EL” in Spanish) do not interfere with meaningful search queries.
- Review AWS CloudSearch stopword processing and align search settings to exclude or properly weight common stopwords.
Introduce structured annotations for project-related metadata:
- Add “Contact Person” as a searchable field to reduce manual effort for access control teams.
- Include “Access Request Status” in search indexing to help streamline administrative workflows.

2. Enhancing Search Filters and Faceting

Provide clear search scoping to help users understand what is being searched (e.g., global search vs. full-text search).
Introduce better filtering for versioned data to prevent users from missing specific dataset revisions.

3. Improving Search UI Clarity

Clarify search field labels to indicate scope and limitations (e.g., does “Search Synapse” mean all entities or just datasets?).
Differentiate homepage search from magnifying glass search if they serve different purposes.
Display real-time result previews so users can see relevant matches without committing to a full search.

4. Strengthening Data Annotation and Metadata Tracking

Ensure better metadata and version tracking for datasets to support precision search queries.
Address search failures related to missing structured annotations (e.g., allowing retrieval of specific dataset versions).

5. Leveraging Industry Best Practices for Search Optimization

Implementing Semantic Search & Context Awareness

Enable meaning-based search rather than relying purely on keyword matching.
Use word embeddings and contextual similarity to improve results for related concepts.
- Example: Searching for “Alzheimer’s biomarkers” should return datasets tagged with “Neurodegeneration markers”, even if the exact phrase is not in the metadata.
Incorporate Natural Language Understanding (NLU) to better interpret complex queries.

Enhancing Query Processing & User Input Handling

Implement live query feedback with real-time suggestions to help users refine searches before submission.
Improve spell correction and typo tolerance to automatically suggest corrected queries.
- Example: “BTATS 2023” should suggest “BRATS 2023” if the typo is minor.
Expand acronym and abbreviation support to resolve queries with domain-specific shorthand.
- Example: “ROSMAP” → “Religious Orders Study and Memory and Aging Project”.
Support multi-token and phrase search flexibility (e.g., allow out-of-order keyword matching).

Improving Search Result Ranking & Relevance

Optimize ranking algorithms to favor exact matches, semantic similarity, and user intent-based relevance.
Implement hybrid search (combining keyword and vector-based search) for more accurate results.
Allow personalized ranking adjustments based on user behavior and past searches.
Introduce dynamic weighting to prioritize metadata-rich entries over less-informative records.

Advanced Filtering, Faceting, and Data Organization

Improve structured filtering options (e.g., versioned data, access controls, entity types).
Enable faceted search improvements to allow drill-down exploration without requiring exact query matches.
Introduce intelligent grouping and clustering of related search results.

Strengthening Metadata & Data Annotation Strategies

Improve data labeling and annotations for better search retrieval.
Include structured fields such as “Contact Person” and “Access Request Status” in the search index.
Ensure better version tracking for datasets to support historical search accuracy.

Enhancing Multilingual & Stopword Handling

Improve multilingual search capabilities to avoid stopword interference.
- Example: Prevent “EL” (common in Spanish) from skewing results.
Implement context-aware stopword removal rather than applying static lists.

Integrating Explainability & Transparency Features

Provide search result explanations (why was this result ranked higher?).
Display confidence scores for semantic matches.
Allow user feedback mechanisms to improve ranking models over time.