syncFromSynapse when synced to paths outside of the cache generates a manifest file(s) describing the files it synced. It will include a "SYNAPSE_METADATA_MANIFEST.tsv" manifest file in the root synced direct, as well as a manifest file in each subdirectory.
There are two issues which combine together to make the performance of this very slow when the directory structure is large and includes many subfolders and files:
1. The manifest in each subdirectory includes all the files that have been synced so far, regardless of whether they are actually rooted in that directory. This appears to be a byproduct of how the recursive sync is implemented rather than an intentional behavior decision.
2. We get the provenance on each file every time we write.a row in the manifest file, doing an HTTP call to Synapse.
For example if we sync a project as follows:
We will end up with 3 manifest files, one in the root synced directory for the Project, and one each for Folder1 and Folder2. The manifest for Folder1 will include File1, the manifest for Folder2 will include both files (even though one of them is in a sibling directory), as will the manifest for the Project. There will be 3 total http calls to fetch the provenance despite there only being two files. This behavior can become severe with deep directory structures.
It's unclear whether we even want to generate multiple manifest files rather than just one at the root of the synchronized structure, but we should at least only include the files rooted in any particular subdirectory in its manifest, and we should cache the provenance fetches so we only retrieve the provenance once per file per sync.
Sent this to the reporter of the issue to see if it resolves their issue.
This external reporter of this issue validated that this fix solves their issue: