Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Proposed Solutions
Amazon Cloudwatch

An attractive system for several reasons:

  • Ease of use - no setup/administration costs other than monetary
  • Integrated - since we're already heavily using Amazon's Cloud services, using Cloudwatch is a pretty natural extension
  • Scalability - Fully a push mechanism so as long as one Synapse is setup correctly they all are.

However there are also some problems:

  • Length of storage - it's not completely clear whether Cloudwatch does any data aggregation.  However, they are very clear that they only keep the original data around for two weeks.  Since this is far less than the period of time that we would like at least some of our metrics to be stored for if Cloudwatch is used, a supplementary data store may be needed.
  • Data expectations - According to Amazon's promotional materials, they expect Cloudwatch to be used for things like "CPU utilization, latency, and request counts".  The thing about these metrics is that they are all time series, that is, it is natural to want to measure these metrics at consistent intervals, and have the data related across time.  This is not entirely suited to the way we need to gather the data for User and Project Activity.  While it could still be useful as the data collection tool, that seems like a waste of resources if it's just going to be an auxiliary component to the real data store.
Overall: probably not suitable
However, the real time series data like the number of active projects or users at any given time (or the number of inactive, aborted etc.) would be perfect for storage in Cloudwatch.  Depending on the actual state of how long data is actually available, this might be a viable solution.
Custom EC2 Instance with RDS backing

Cloudwatch stores data for exactly two weeks.  To extend the life span, the data must be retrieved via CloudWatch API and then stored to S3, DynamoDB, or Redshift.

Custom EC2 Instance

Pros:

  • No storage limitations except cost
  • No data expectations to workaround
  • This EC2 instance would be able to act as both the data collection mechanism and the data store, allowing it to keep it's back end storage mechanism in a consistent state.

Cons:

  • Another custom application/library to build and maintain
  • It's not clear what the best way to actually implement the collecting of the data would be.
Google BigQuery

Pros:

  • No storage limitations except cost
  • Can store the data in full detail
  • Fast search times, with no pre-calculated search parameters
    • Thus exposing a new metric is as simple as thinking of it, and then implementing fetching the data from BigQuery, and then revealing it in the UI
  • The infrastructure is built for us.  Amazon provides all the pieces to make a system like BigQuery, or at least that solves the same problems, but they're pieces, not a product.

Cons:

  • It's a Google technology, not Amazon thus doubling the number of accounts, maintenance etc.
  • I feel like there may be others, but I can't think of them right now

Data Requirements

User Activity Data

Given data set: A time-ordered list (or set of lists) of user auth-events from the crowd servers

...

  • Current activity status
  • Creation date
  • Last login date
  • Number of total logins
Project Activity Data

Basically the same source data as for User activity is available (or could be), and the method for calculating it is essentially the same.  So it's essentially the same metric, just for projects, not users.  There may be the additional information on what type of usage it was (data access, data modification, data addition etc.), but otherwise the same.

Summary

It seems like the best option for the back end would be a combination of using both a custom data storage system and Amazon Cloudwatch.  The custom data store would be used to hold "windowed" data, that is high resolution raw data (like per-user login events with timestamps, or project activity events) but for a specifically limited period of time.  This data would then be periodically (daily) aggregated and pushed to Cloudwatch.  Then, if it turns out that we want to store data at a higher resolution or for a longer period than cloud watch does, we can export it to our own round robin database tool (possibly rrdTool).There is one catch to the assumption that the project metric is identical to the user metric.  The project metric relies on identifying specific events that can only ever occur once, like a user from a new company contributing to a project.  This only happens once, making this a much more interesting metric to try and compute.

Given data set: A set of individually time-ordered logs (one from each Synapse instance) of all web-service calls logging:

  • user id
  • entity id/project path of id's (up to root project)
  • time stamp
  • profiling data (elapsed time)
  • what call was made (what kind of a change)

Summary