...
- The actual dashboard component, or the UI (front end)
- The data storage/collection mechanism (back end)
The dashboard is a pretty straightforward piece of software, simply taking information from the storage mechanism (whatever that might be) and displaying it to the end user. For this end of the project it seems natural to continue using technologies that are already in use - namely GWT. Specifically, to facilitate creating a user-friendly experience, and for the support that it provides in terms of graphs and charts GXT 3 seems like a strong candidate for doing the main UI work.
...
- Round Robin type data store - Basically any kind of storage that is fixed in size, and does semi-automatic data aggregation. The basic idea is that for a period X you have the full details of whatever data you log. Then after X has elapsed, the data is aggregated in one, or several ways (average, minimum, maximum). This aggregate data is then stored for another period Y. Repeat until the data is no longer relevant and can be dropped from the store.
- Data Collection - Since Synapse is on Amazon's Elastic Beanstalk, there is the possibility that data usage must be aggregated from several different Synapse instances. In addition, certain data (like user activity) is most easily collected from other sources than the services (like the Crowd servers). Thus some kind of data collection mechanism is needed.
- Data Interpretation - Since both metrics so far proposed (User and project activity levels) are somewhat expensive to compute (if it's even possible), ideally the front-end GUI will not request that this data be recomputed ever. Some background process - whether it is hosted in the metrics web server, or run independently - is needed to do any kind of pre-processing to the data before it is entered into the data store.
Proposed Solutions
Amazon Cloudwatch
Cloudwatch stores data for exactly two weeks. To extend the life span, the data must be retrieved via CloudWatch API and then stored to S3, DynamoDB, or Redshift.
Custom EC2 Instance
Pros:
- No storage limitations except cost
- No data expectations to workaround
- This EC2 instance would be able to act as both the data collection mechanism and the data store, allowing it to keep it's back end storage mechanism in a consistent state.
Cons:
- Another custom application/library to build and maintain
- It's not clear what the best way to actually implement the collecting of the data would be.
Google BigQuery
Pros:
- No storage limitations except cost
- Can store the data in full detail
- Fast search times, with no pre-calculated search parameters
- Thus exposing a new metric is as simple as thinking of it, and then implementing fetching the data from BigQuery, and then revealing it in the UI
- The infrastructure is built for us. Amazon provides all the pieces to make a system like BigQuery, or at least that solves the same problems, but they're pieces, not a product.
Cons:
- It's a Google technology, not Amazon thus doubling the number of accounts, maintenance etc.
- I feel like there may be others, but I can't think of them right now
Data Requirements
User Activity Data
Given data set: A time-ordered list (or set of lists) of user auth-events from the crowd servers
Computed data points:
- For each user, a list of recent logins (either a fixed number or for a window)
- Activity status - New, Aborted, Active, Inactive. This could be computed on a daily basis from current activity status and the login record for today. This method of continuous calculation would also make it easier to 'detect' changes to someone's status (i.e. maybe put them on a list of user's that transitioned to a new status).
Persistent daily data needed:
- Current activity status
- Creation date
- Last login date
- Number of total logins
Project Activity Data
Basically the same source data as for User activity is available (or could be), and the method for calculating it is essentially the same. So it's essentially the same metric, just for projects, not users. There may be the additional information on what type of usage it was (data access, data modification, data addition etc.), but otherwise the same.
There is one catch to the assumption that the project metric is identical to the user metric. The project metric relies on identifying specific events that can only ever occur once, like a user from a new company contributing to a project. This only happens once, making this a much more interesting metric to try and compute.
Given data set: A set of individually time-ordered logs (one from each Synapse instance) of all web-service calls logging:
- user id
- entity id/project path of id's (up to root project)
- time stamp
- profiling data (elapsed time)
- what call was made (what kind of a change)
Summary