Description
Many modern web applications have ways to aggregate and present metrics on how the application is being used in practice to those developing the application. For example, sites like Amazon have rich and detailed information on what features get used, and translate into greater levels of purchasing on the site. Synapse currently has a small user base and we can currently get a lot of information by talking directly to the users. However, real metrics on actual user behavior provide another window into how the system is (or isn't) being used and can answer a variety of questions of interest to the business.
At the beginning of the year we defined two measurements of Synapse use that we wanted to track, and that fed into our yearly objectives. These form the basis for initial dashboard design.
User Activity
The highest level measure of our user community is to track number of people who regularly return to the site. Let's define the following:
- New User - Someone who's created a new Synapse account, has not logged in 3 times, and is less than 30 days from their account creation date.
- Aborted user - Someone who's created a new Synapse account more than 30 days ago, but has never become an Active user.
- Active User - someone who's logged into Synapse 3 times in previous 30 day window
- Inactive User - someone who was previously Active, but is no longer.
We'd like to track the number of these over time, at the granularity of a day. We'd like to know who they are, and especially highlight changes in a user's categorization. We'd like to know the time and date of their last log in, and possible drill in to get complete login history period (at least over some recent window of time).
We'd like to be able to break these out by the user's organization. We can either use email domain as a proxy for organization when possible, or collect organization as part of user's profile if we can get it there.
In our company objectives, we define a use metric N based on # of active users. In this calculation we exclude platform team and administrative users from the metric, and count users as double if they represent a unique organization. So, if we have 5 Sage scientists, 3 people from U of X, and 1 person from U of Y as active users, N=12. Our September "Achieved" and "Stretch" goals are N = 30 and 50 respectively. We want to plot N over time on the dashboard.
Project Activity
The objective of Synapse is to be a place where people actively work together, not just a place where people go to get interesting data. One measure of active use is if people have created a project to hold their own work. Similar to users, lets define the following:
- New project - created less than 30 days ago and not yet active
- Aborted project - created more than 30 days ago and not yet active
- Active project - someone has posted content to project in two different sessions in past 30 days
- Collaborative project - at least two different users have posted content to project in past 30 days. All collaborative project are also active projects.
- Networked project - at least two different users from two different organizations have posted content to project in past 30 days. All networked projects are also collaborative and active projects.
- Inactive project - project once was active but no longer is.
In our objectives, we define P = Number of Active Projects as (New content posted to project in last month by 2 different users; if users span multiple organizations count double; if project linked to publication count double) Our September goals are A >= 20; S >= 30. Think we can exclude the project linked to publication for now, and just plot P over time.
We will want to drill into active projects and see what users are working on the projects.
Requirements
- The Dashboard will surface information about how users are actually behaving on Synapse so that Sage's Engineering and Management teams can drill into this data to make a variety of discussions related to the product roadmap and development of partnerships. The dashboard must serve both technical and non-technical users (e.g. it must be useable by the CEO).
- In the short term, we want to limit visibility of the dashboard to Sage employees, but make it very easy for all Sage employees to get at the dashboard. Embedding the dashboard in either the Sage intranet or a Confluence wiki page would be a great way to accomplish this.
- In the long term we might want to also surface some metrics about Synapse on Synapse itself, or expose a public API for others to access and mine our metrics for a variety of purposes. For example, a large data generator might want to be able to find out who is using their data. Journals or funding agencies might want to assess the impact of work performed on Synapse. However, we probably will always want a separate Sage-only dashboard that may be more specific or tailored to our needs than what we'd put on Synapse itself for public consumption.
- We are interested in observing long-term trends in user behavior over the course of months or years. We will want to demonstrate uptake of the technology for purposes like raising grant money to continue Synapse development. We are also interested in short term snapshots, e.g. what users have recently become active / inactive in last 30 days that might require someone making contact with the user and understanding what has happened.
- It's not necessary that this be an operational dashboard for technical people to monitor and trouble shoot the performance of Synapse or it's components. Cloud Watch type metrics on things like load on different infrastructure components are a different category of metric, and can be managed separately.
- We expect the specific metrics gathered to start off high level and general, and to continuously evolve and become more granular as we generate more questions to ask of Synapse about its users. We want to make it easy for new developers to incrementally add to the dashboard. For example, a new developer might develop a new feature and add new custom metrics to measure how the feature is actually used in production by live users.
- We want to capture both activity from the web application and analytical client tools. Note that we have turned on Google analytics for the Synapse web application at https://www.google.com/analytics/ login with account infrastructure@sagebase.org, password in the usual place. We don't want to duplicate things in our metrics system that we get for free out of this.
Design Options
There are several components to a tracking system such as this:
- The actual dashboard component, or the UI (front end)
- The data storage/collection mechanism (back end)
The dashboard is a pretty straightforward piece of software, simply taking information from the storage mechanism (whatever that might be) and displaying it to the end user. For this end of the project it seems natural to continue using technologies that are already in use - namely GWT. Specifically, to facilitate creating a user-friendly experience, and for the support that it provides in terms of graphs and charts GXT 3 seems like a strong candidate for doing the main UI work.
For the back end, the choices are less clear cut, so let's start by listing some requirements.
Back End Requirements
- Round Robin type data store - Basically any kind of storage that is fixed in size, and does semi-automatic data aggregation. The basic idea is that for a period X you have the full details of whatever data you log. Then after X has elapsed, the data is aggregated in one, or several ways (average, minimum, maximum). This aggregate data is then stored for another period Y. Repeat until the data is no longer relevant and can be dropped from the store.
- Data Collection - Since Synapse is on Amazon's Elastic Beanstalk, there is the possibility that data usage must be aggregated from several different Synapse instances. In addition, certain data (like user activity) is most easily collected from other sources than the services (like the Crowd servers). Thus some kind of data collection mechanism is needed.
- Data Interpretation - Since both metrics so far proposed (User and project activity levels) are somewhat expensive to compute (if it's even possible), ideally the front-end GUI will not request that this data be recomputed ever. Some background process - whether it is hosted in the metrics web server, or run independently - is needed to do any kind of pre-processing to the data before it is entered into the data store.
Proposed Solutions
Amazon Cloudwatch
An attractive system for several reasons:
- Ease of use - no setup/administration costs other than monetary
- Integrated - since we're already heavily using Amazon's Cloud services, using Cloudwatch is a pretty natural extension
- Scalability - Fully a push mechanism so as long as one Synapse is setup correctly they all are.
However there are also some problems:
- Length of storage - it's not completely clear whether Cloudwatch does any data aggregation. However, they are very clear that they only keep the original data around for two weeks. Since this is far less than the period of time that we would like at least some of our metrics to be stored for if Cloudwatch is used, a supplementary data store may be needed.
- Data expectations - According to Amazon's promotional materials, they expect Cloudwatch to be used for things like "CPU utilization, latency, and request counts". The thing about these metrics is that they are all time series, that is, it is natural to want to measure these metrics at consistent intervals, and have the data related across time. This is not entirely suited to the way we need to gather the data for User and Project Activity. While it could still be useful as the data collection tool, that seems like a waste of resources if it's just going to be an auxiliary component to the real data store.
Custom EC2 Instance with RDS backing
- No storage limitations except cost
- No data expectations to workaround
- This EC2 instance would be able to act as both the data collection mechanism and the data store, allowing it to keep it's back end storage mechanism in a consistent state.
Cons:
- Another custom application/library to build and maintain
- It's not clear what the best way to actually implement the collecting of the data would be.
Data Requirements
User Activity
Given data set: A time-ordered list (or set of lists) of user auth-events from the crowd servers
Computed data points:
- For each user, a list of recent logins (either a fixed number or for a window)
- Activity status - New, Aborted, Active, Inactive. This could be computed on a daily basis from current activity status and the login record for today. This method of continuous calculation would also make it easier to 'detect' changes to someone's status (i.e. maybe put them on a list of user's that transitioned to a new status).
Persistent daily data needed:
- Current activity status
- Creation date
- Last login date
- Number of total logins
Project Activity
Basically the same source data as for User activity is available (or could be), and the method for calculating it is essentially the same. So it's essentially the same metric, just for projects, not users. There may be the additional information on what type of usage it was (data access, data modification, data addition etc.), but otherwise the same.
Summary
It seems like the best option for the back end would be a combination of using both a custom data storage system and Amazon Cloudwatch. The custom data store would be used to hold "windowed" data, that is high resolution raw data (like per-user login events with timestamps, or project activity events) but for a specifically limited period of time. This data would then be periodically (daily) aggregated and pushed to Cloudwatch. Then, if it turns out that we want to store data at a higher resolution or for a longer period than cloud watch does, we can export it to our own round robin database tool (possibly rrdTool).
Below is a diagram of the basic data flow for the application. The main tool that needs to be implemented is the "Collector" aspect of the diagram below. This can probably be implemented as an Amazon Simple Workflow worker.
Open Questions
There are a number of other implementation questions which are unresolved.