Synapse Usage Metrics Dashboard

Description

Many modern web applications have ways to aggregate and present metrics on how the application is being used in practice to those developing the application. For example, sites like Amazon have rich and detailed information on what features get used, and translate into greater levels of purchasing on the site. Synapse currently has a small user base and we can currently get a lot of information by talking directly to the users. However, real metrics on actual user behavior provide another window into how the system is (or isn't) being used and can answer a variety of questions of interest to the business.

At the beginning of the year we defined two measurements of Synapse use that we wanted to track, and that fed into our yearly objectives. These form the basis for initial dashboard design.

User Activity

The highest level measure of our user community is to track number of people who regularly return to the site. Let's define the following:

New User - Someone who's created a new Synapse account, has not logged in 3 times, and is less than 30 days from their account creation date.
Aborted user - Someone who's created a new Synapse account more than 30 days ago, but has never become an Active user.
Active User - someone who's logged into Synapse 3 times in previous 30 day window
Inactive User - someone who was previously Active, but is no longer.

We'd like to track the number of these over time, at the granularity of a day. We'd like to know who they are, and especially highlight changes in a user's categorization. We'd like to know the time and date of their last log in, and possible drill in to get complete login history period (at least over some recent window of time).

We'd like to be able to break these out by the user's organization. We can either use email domain as a proxy for organization when possible, or collect organization as part of user's profile if we can get it there.

In our company objectives, we define a use metric N based on # of active users. In this calculation we exclude platform team and administrative users from the metric, and count users as double if they represent a unique organization. So, if we have 5 Sage scientists, 3 people from U of X, and 1 person from U of Y as active users, N=12. Our September "Achieved" and "Stretch" goals are N = 30 and 50 respectively. We want to plot N over time on the dashboard.

Project Activity

The objective of Synapse is to be a place where people actively work together, not just a place where people go to get interesting data. One measure of active use is if people have created a project to hold their own work. Similar to users, lets define the following:

New project - created less than 30 days ago and not yet active
Aborted project - created more than 30 days ago and not yet active
Active project - someone has posted content to project in two different sessions in past 30 days
Collaborative project - at least two different users have posted content to project in past 30 days. All collaborative project are also active projects.
Networked project - at least two different users from two different organizations have posted content to project in past 30 days. All networked projects are also collaborative and active projects.
Inactive project - project once was active but no longer is.

In our objectives, we define P = Number of Active Projects as (New content posted to project in last month by 2 different users; if users span multiple organizations count double; if project linked to publication count double) Our September goals are A >= 20; S >= 30. Think we can exclude the project linked to publication for now, and just plot P over time.

We will want to drill into active projects and see what users are working on the projects.

Requirements

The Dashboard will surface information about how users are actually behaving on Synapse so that Sage's Engineering and Management teams can drill into this data to make a variety of discussions related to the product roadmap and development of partnerships. The dashboard must serve both technical and non-technical users (e.g. it must be useable by the CEO).
In the short term, we want to limit visibility of the dashboard to Sage employees, but make it very easy for all Sage employees to get at the dashboard. Embedding the dashboard in either the Sage intranet or a Confluence wiki page would be a great way to accomplish this.
In the long term we might want to also surface some metrics about Synapse on Synapse itself, or expose a public API for others to access and mine our metrics for a variety of purposes. For example, a large data generator might want to be able to find out who is using their data. Journals or funding agencies might want to assess the impact of work performed on Synapse. However, we probably will always want a separate Sage-only dashboard that may be more specific or tailored to our needs than what we'd put on Synapse itself for public consumption.
We are interested in observing long-term trends in user behavior over the course of months or years. We will want to demonstrate uptake of the technology for purposes like raising grant money to continue Synapse development. We are also interested in short term snapshots, e.g. what users have recently become active / inactive in last 30 days that might require someone making contact with the user and understanding what has happened.
It's not necessary that this be an operational dashboard for technical people to monitor and trouble shoot the performance of Synapse or it's components. Cloud Watch type metrics on things like load on different infrastructure components are a different category of metric, and can be managed separately.
We expect the specific metrics gathered to start off high level and general, and to continuously evolve and become more granular as we generate more questions to ask of Synapse about its users. We want to make it easy for new developers to incrementally add to the dashboard. For example, a new developer might develop a new feature and add new custom metrics to measure how the feature is actually used in production by live users.
We want to capture both activity from the web application and analytical client tools. Note that we have turned on Google analytics for the Synapse web application at https://www.google.com/analytics/ login with account infrastructure@sagebase.org, password in the usual place. We don't want to duplicate things in our metrics system that we get for free out of this.

Design Options

There are several components to a tracking system such as this:

The actual dashboard component, or the UI (front end)
The data storage/collection mechanism (back end)

The dashboard is a pretty straightforward piece of software, simply taking information from the storage mechanism (whatever that might be) and displaying it to the end user. For this end of the project it seems natural to continue using technologies that are already in use - namely GWT. Specifically, to facilitate creating a user-friendly experience, and for the support that it provides in terms of graphs and charts GXT 3 seems like a strong candidate for doing the main UI work.

For the back end, the choices are less clear cut, so let's start by listing some requirements.

Back End Requirements

Round Robin type data store - Basically any kind of storage that is fixed in size, and does semi-automatic data aggregation. The basic idea is that for a period X you have the full details of whatever data you log. Then after X has elapsed, the data is aggregated in one, or several ways (average, minimum, maximum). This aggregate data is then stored for another period Y. Repeat until the data is no longer relevant and can be dropped from the store.
Data Collection - Since Synapse is on Amazon's Elastic Beanstalk, there is the possibility that data usage must be aggregated from several different Synapse instances. In addition, certain data (like user activity) is most easily collected from other sources than the services (like the Crowd servers). Thus some kind of data collection mechanism is needed.
Data Interpretation - Since both metrics so far proposed (User and project activity levels) are somewhat expensive to compute (if it's even possible), ideally the front-end GUI will not request that this data be recomputed ever. Some background process - whether it is hosted in the metrics web server, or run independently - is needed to do any kind of pre-processing to the data before it is entered into the data store.

Proposed Solutions

Amazon Cloudwatch

Cloudwatch stores data for exactly two weeks. To extend the life span, the data must be retrieved via CloudWatch API and then stored to S3, DynamoDB, or Redshift.

Custom EC2 Instance

Pros:

No storage limitations except cost
No data expectations to workaround
This EC2 instance would be able to act as both the data collection mechanism and the data store, allowing it to keep it's back end storage mechanism in a consistent state.

Cons:

Another custom application/library to build and maintain
It's not clear what the best way to actually implement the collecting of the data would be.

Google BigQuery

Pros:

No storage limitations except cost
Can store the data in full detail
Fast search times, with no pre-calculated search parameters
- Thus exposing a new metric is as simple as thinking of it, and then implementing fetching the data from BigQuery, and then revealing it in the UI
The infrastructure is built for us. Amazon provides all the pieces to make a system like BigQuery, or at least that solves the same problems, but they're pieces, not a product.

Cons:

It's a Google technology, not Amazon thus doubling the number of accounts, maintenance etc.
I feel like there may be others, but I can't think of them right now

Data Requirements

User Activity Data

Given data set: A time-ordered list (or set of lists) of user auth-events from the crowd servers

Computed data points:

For each user, a list of recent logins (either a fixed number or for a window)
Activity status - New, Aborted, Active, Inactive. This could be computed on a daily basis from current activity status and the login record for today. This method of continuous calculation would also make it easier to 'detect' changes to someone's status (i.e. maybe put them on a list of user's that transitioned to a new status).

Persistent daily data needed:

Current activity status
Creation date
Last login date
Number of total logins

Project Activity Data

Basically the same source data as for User activity is available (or could be), and the method for calculating it is essentially the same. So it's essentially the same metric, just for projects, not users. There may be the additional information on what type of usage it was (data access, data modification, data addition etc.), but otherwise the same.

There is one catch to the assumption that the project metric is identical to the user metric. The project metric relies on identifying specific events that can only ever occur once, like a user from a new company contributing to a project. This only happens once, making this a much more interesting metric to try and compute.

Given data set: A set of individually time-ordered logs (one from each Synapse instance) of all web-service calls logging:

user id
entity id/project path of id's (up to root project)
time stamp
profiling data (elapsed time)
what call was made (what kind of a change)