Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...

#LevelActorsDescription
1ProjectProject Owner, FunderA user would like to see an overview of the total number of files downloaded within a project
2ProjectProject Owner, FunderA user would like to see the trend over time of the total number of files downloaded within a project
3ProjectProject Owner, FunderA user would like to see the number of unique users that performed file downloads within a project
4ProjectProject Owner, FunderIn reference to use cases 1, 2 and 3 a user would like to have the option to see this information within a specific time range
5ProjectProject Owner, FunderIn reference to use cases 1, 2 and 3 a user would like to have the option to filter this information by a specific team
6ProjectProject OwnerA user would like to get the list of most downloaded files in the project along with their download/users counts.
7FileProject Owner, Synapse User, ContributorA user would like to see the total number of downloads for a specific file
78FileProject Owner, Synapse User, ContributorA user would like to see the trend over time of the total number of downloads for a specific file
89FileProject Owner, Synapse User, ContributorA user would like to see the number of unique users that downloaded a specific file
910FileProject Owner, Synapse User, ContributorIn reference to use cases 6, 7 and 8 a user would like to have the option to see this information within a specific time range
1011FileProject Owner, Synapse User, ContributorIn reference to use cases 6, 7 and 8 a user would like to have the option to filter this information by a specific team
1112FileProject OwnerIn the context of a challenge the user would like to see the trend of downloads of a specific file (dataset) among the teams participating in the challenge. In particular they would like to have the fraction of teams out of all the teams in the challenge that downloaded a file at least once in order to understand how many teams didn't download the dataset.


  • Use cases 5 and 10 11 are currently removed as it seems this is not a request.
  • Use case 6 is a nice to have, we might want to skip this in the first implementation.
  • Use case 11 12 is a unique nice to have, not a priority.


Note

Use cases for the count for folders, table, views etc are not included as they were not specifically mentioned by the users

...

  • Aside from the count of downloads there is also the need to aggregate this data in a way that captures the trend of file downloads over time. For example the user is interested to see the number of downloads within a project or for a specific file grouped daily, weekly, monthly, yearly. There was no specific request to have averages but we could think of including it.
  • Note that the information about the number of downloads might also appear in the file entities in a view, while no specific request was done for files (handles) linked in a table. In general the use cases focus on statistics about downloads over FileEntity.
  • We might consider an additional dedicated permission that could be used to allow access to statistics (e.g. limiting its access by default). On the other end the use case for this is not defined, it should not be a problem to use the (computed) READ permission on the entity for statistics that do not contain identifiable data and expose aggregations only. We can start simple and use the canRead permission computed for entity (See Entity Permissions) and if needed in the future add a new type of permission specific for statistics (e.g. VIEW_STATISTICS ACCESS_TYPE).
    Note that if a file is not accessible to the user its statistics should not be accessible, if a folder is not accessible to the user the statistics about the files in the folder should not be accessible (unless the files have local permissions that allow access to them).
  • Also note that the value of the statistics about a parent element (e.g. the project) should maintain consistency and include the count relative to the child objects no matter the permissions of the child objects (e.g. in other words if I change the permissions of a child entity, the aggregate information about the parent should not change).
  • File entities are versioned, we would need to have a way to gather statistics for a specific version of the entity as well as the total among all the versions.
  • File entities might have previews, we should avoid including in the count the downloads of the previews.
  • Note from John: Is the team filtering really needed? It appears so from the doc 
    Lref gdrive file
    urlhttps://docs.google.com/document/d/1t3OL3TnORZwpHOAgLuUt8NNUTLGjDuGj1UUrpggj54k/edit
    , but the actual statistics reported for downloads do not seem to filter by team (See screenshot below)
  • Filtering the information might be contextual, for example filtering by a specific team might mean to filter by the current team members or depending on the team members at the time of the download: it seems that the users are oriented towards the first option, a team filter might be used to limit the statistics to a limited set of users. This poses an interesting question: should we limit the team filtering to teams with at least X users? (e.g. to avoid people misusing the statistics to identify specific user behavior). Also adding the option to filter the statistics by team adds a great deal of complexity in computing and storing the statistics.
  • Use Case 11 12 is from Andrew L. for the dream challenges, is not a strong use case but a nice to have.

...


The current solution poses some challenges:

  • The process is relatively lengthy and requires non trivial technical skills
  • The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
  • The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
  • The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads


Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

MethodEndpointRequest BodyResponse BodyDescriptionRestrictions
POST/asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The DownloadBucketsStatisticsRequest will compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.
GET/asynchronous/job/{id}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id.

...

Additionally we propose to have dedicated statistics endpoints to be consistent with the current API design that accepts the above requests:


MethodEndpointRequest BodyResponse BodyDescription
POST/statistics/entity/download/async/start
  • DownloadStattisticsRequest 
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatusAllows to submit the job to gather the download statistics for an entity. The id returned by the request is used in order to the the job status.
GET/statistics/entity/download/async/get/{asyncToken}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id

Web Client Integration


The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...