Improve syncFromSynapse performance for large folder structures synced to external paths

Description

syncFromSynapse when synced to paths outside of the cache generates a manifest file(s) describing the files it synced. It will include a "SYNAPSE_METADATA_MANIFEST.tsv" manifest file in the root synced direct, as well as a manifest file in each subdirectory.

There are two issues which combine together to make the performance of this very slow when the directory structure is large and includes many subfolders and files:
1. The manifest in each subdirectory includes all the files that have been synced so far, regardless of whether they are actually rooted in that directory. This appears to be a byproduct of how the recursive sync is implemented rather than an intentional behavior decision.
2. We get the provenance on each file every time we write.a row in the manifest file, doing an HTTP call to Synapse.

For example if we sync a project as follows:

We will end up with 3 manifest files, one in the root synced directory for the Project, and one each for Folder1 and Folder2. The manifest for Folder1 will include File1, the manifest for Folder2 will include both files (even though one of them is in a sibling directory), as will the manifest for the Project. There will be 3 total http calls to fetch the provenance despite there only being two files. This behavior can become severe with deep directory structures.

It's unclear whether we even want to generate multiple manifest files rather than just one at the root of the synchronized structure, but we should at least only include the files rooted in any particular subdirectory in its manifest, and we should cache the provenance fetches so we only retrieve the provenance once per file per sync.

Environment

None

Activity

Show:
Jordan Kiang
July 1, 2020, 4:31 PM

Sent this to the reporter of the issue to see if it resolves their issue.

Jordan Kiang
July 9, 2020, 6:44 PM

This external reporter of this issue validated that this fix solves their issue:

Assignee

Jordan Kiang

Reporter

Jordan Kiang

Labels

None

Validator

Bruce Hoff

Development Area

Synapse Core Infrastructure

Release Version History

None

Components

Fix versions

Priority

Major
Configure