Description

Many modern web applications have ways to aggregate and present metrics on how the application is being used in practice to those developing the application.  For example, sites like Amazon have rich and detailed information on what features get used, and translate into greater levels of purchasing on the site.  Synapse currently has a small user base and we can currently get a lot of information by talking directly to the users.  However, real metrics on actual user behavior provide another window into how the system is (or isn't) being used and can answer a variety of questions of interest to the business.

At the beginning of the year we defined two measurements of Synapse use that we wanted to track, and that fed into our yearly objectives.  These form the basis for initial dashboard design.

User Activity

The highest level measure of our user community is to track number of people who regularly return to the site.  Let's define the following:

  1. New User - Someone who's created a new Synapse account, has not logged in 3 times, and is less than 30 days from their account creation date.
  2. Aborted user - Someone who's created a new Synapse account more than 30 days ago, but has never become an Active user.
  3. Active User - someone who's logged into Synapse 3 times in previous 30 day window
  4. Inactive User - someone who was previously Active, but is no longer.

We'd like to track the number of these over time, at the granularity of a day.  We'd like to know who they are, and especially highlight changes in a user's categorization. We'd like to know the time and date of their last log in, and possible drill in to get complete login history period (at least over some recent window of time).

We'd like to be able to break these out by the user's organization.  We can either use email domain as a proxy for organization when possible, or collect organization as part of user's profile if we can get it there.

In our company objectives, we define a use metric N based on # of active users.  In this calculation we exclude platform team and administrative users from the metric, and count users as double if they represent a unique organization.  So, if we have 5 Sage scientists, 3 people from U of X, and 1 person from U of Y as active users, N=12.  Our September "Achieved" and "Stretch" goals are N = 30 and 50 respectively.  We want to plot N over time on the dashboard. 

Project Activity

The objective of Synapse is to be a place where people actively work together, not just a place where people go to get interesting data.  One measure of active use is if people have created a project to hold their own work.  Similar to users, lets define the following:

  1. New project - created less than 30 days ago and not yet active
  2. Aborted project - created more than 30 days ago and not yet active
  3. Active project - someone has posted content to project in two different sessions in past 30 days
  4. Collaborative project - at least two different users have posted content to project in past 30 days.  All collaborative project are also active projects.
  5. Networked project - at least two different users from two different organizations have posted content to project in past 30 days.  All networked projects are also collaborative and active projects.
  6. Inactive project - project once was active but no longer is.

In our objectives, we define P = Number of Active Projects as (New content posted to project in last month by 2 different users; if users span multiple organizations count double; if project linked to publication count double)  Our September goals are A >= 20; S >= 30.  Think we can exclude the project linked to publication for now, and just plot P over time.

We will want to drill into active projects and see what users are working on the projects.

Requirements

  1. The Dashboard will surface information about how users are actually behaving on Synapse so that Sage's Engineering and Management teams can drill into this data to make a variety of discussions related to the product roadmap and development of partnerships.  The dashboard must serve both technical and non-technical users (e.g. it must be useable by the CEO). 
  2. In the short term, we want to limit visibility of the dashboard to Sage employees, but make it very easy for all Sage employees to get at the dashboard.  Embedding the dashboard in either the Sage intranet or a Confluence wiki page would be a great way to accomplish this.
  3. In the long term we might want to also surface some metrics about Synapse on Synapse itself, or expose a public API for others to access and mine our metrics for a variety of purposes.  For example, a large data generator might want to be able to find out who is using their data.  Journals or funding agencies might want to assess the impact of work performed on Synapse.  However, we probably will always want a separate Sage-only dashboard that may be more specific or tailored to our needs than what we'd put on Synapse itself for public consumption.
  4. We are interested in observing long-term trends in user behavior over the course of months or years.  We will want to demonstrate uptake of the technology for purposes like raising grant money to continue Synapse development.  We are also interested in short term snapshots, e.g. what users have recently become active / inactive in last 30 days that might require someone making contact with the user and understanding what has happened. 
  5. It's not necessary that this be an operational dashboard for technical people to monitor and trouble shoot the performance of Synapse or it's components.  Cloud Watch type metrics on things like load on different infrastructure components are a different category of metric, and can be managed separately.
  6. We expect the specific metrics gathered to start off high level and general, and to continuously evolve and become more granular as we generate more questions to ask of Synapse about its users.  We want to make it easy for new developers to incrementally add to the dashboard.  For example, a new developer might develop a new feature and add new custom metrics to measure how the feature is actually used in production by live users.
  7. We want to capture both activity from the web application and analytical client tools.  Note that we have turned on Google analytics for the Synapse web application at https://www.google.com/analytics/ login with account infrastructure@sagebase.org, password in the usual place.  We don't want to duplicate things in our metrics system that we get for free out of this.

Design Options

There are several components to a tracking system such as this:  

  1. The actual dashboard component, or the UI (front end)
  2. The data storage/collection mechanism (back end)

The dashboard is a pretty straightforward piece of software, simply taking information from the storage mechanism (whatever that might be) and displaying it to the end user.  For this end of the project it seems natural to continue using technologies that are already in use - namely GWT.  Specifically, to facilitate creating a user-friendly experience, and for the support that it provides in terms of graphs and charts GXT 3 seems like a strong candidate for doing the main UI work.

 

For the back end, the choices are less clear cut, so let's start by listing some requirements.

Back End Requirements

Proposed Solutions
Amazon Cloudwatch

Cloudwatch stores data for exactly two weeks.  To extend the life span, the data must be retrieved via CloudWatch API and then stored to S3, DynamoDB, or Redshift.

Custom EC2 Instance

Pros:

Cons:

Google BigQuery

Pros:

Cons:

Data Requirements

User Activity Data

Given data set: A time-ordered list (or set of lists) of user auth-events from the crowd servers

Computed data points:

Persistent daily data needed:

Project Activity Data

Basically the same source data as for User activity is available (or could be), and the method for calculating it is essentially the same.  So it's essentially the same metric, just for projects, not users.  There may be the additional information on what type of usage it was (data access, data modification, data addition etc.), but otherwise the same.

There is one catch to the assumption that the project metric is identical to the user metric.  The project metric relies on identifying specific events that can only ever occur once, like a user from a new company contributing to a project.  This only happens once, making this a much more interesting metric to try and compute.

Given data set: A set of individually time-ordered logs (one from each Synapse instance) of all web-service calls logging:

Summary