Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Auditing data for the Synapse REST API is captured by a Spring Interceptor: AccessInterceptor that is similar to web filter.  This interceptor is configured to listen to all web services calls made to the repository services.  For each call, the AccessInterceptor will gather data to fill out an AccessRecord model object.  The AccessRecord data is then written as zipped CSV files directly to the  prod.access.record.sagebase.org  S3 bucket.  These CSV files are initially too small to process efficiently so a worker process merges the files by hour.

...

  • returnedObjectId - For any method that returns an object with an ID, this column will contain the returned ID.  This is the only way to determine the ID of a newly created object from a POST.
  • elaseMS - The elapse time of the call in milliseconds.
  • timestamp - The exact time the call was made in epoch time (milliseconds since 1/1/1970).
  • via - The value of the "via" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
  • host - The value of the "host" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
  • threadId- The ID of the thread used to process the request.
  • userAgent - The value of the "User-Agent" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
  • queryString - The value of the request queryString.
  • sessionId - For each call a new UUID is generated for the sessionId.  The sessionId is also bound to the logging thread context and written in all log entries.  This ties access records to log entries.
  • xForwardedFor - The value of the "X-Forwarded-For" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
  • resquestURL - The URL of the request.
  • userID - For calls where the users is authenticated via a sessionToken or an API key, this column will contain the numeric ID of the user.
  • origin - The value of the "Origin" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
  • date - The year-month-day date string.
  • method - The HTTP method: GET, POST, PUT, DELETE
  • vmId - When each EC2 instances of a stack starts, a new unique identifier for the JVM is issued.  This is captured in the access log so calls form a single machine can be grouped together.
  • instance - The instance number of the stack.
  • stack - The stack identifier.  This will always be "prod" for production stacks.
  • success - Set to true when a call complete without an exception, otherwise set to false.  The stack trace of exceptions can be found by searching the logs for the the sessionId of any failed access records.

Log Files

...

Audit Analysis

Analysis of the audit data can be done using EMR-Hive.  The AWS extensions to Hive include support for reading zipped CSV data directly from S3.  This means we can launch an Elastic Map Reduce (EMR) Hive cluster and copy all access record data from S3 to the Hive cluster.  Once the data is loaded on the cluster add-hock and canned quires can be executed to generate reports or discover new trends.  The following outlines how to get started with the analysis of the Synapse Audit data.

Launch a EMR Hive Cluster

Launch a new EMR Hive cluster (see Launch a Hive Cluster for more information).

Log into the Production AWS account and navigate to the Elastic MapReduce Service page and select "Create New Job Flow":

Image Added

Select a Hive Program:

Image Added

Choose "Start an Interactive Hive Session"

Image Added

Choose the number and types of the EC2 instances for the cluster. There must be at least one "Master Instance Group" instance and two "Core Instance Group" instances.

Image Added

Setup the Amazon EC2 Key Pair to "prod-key-pair".  You will then need to use the prod-key-pair to SSH into the cluster.

Image Added

No bootstrap is needed:

Image Added

Final check for the cluster configuration.  Select "Create Job Flow" to launch the cluster:

Image Added

Once the cluster is launched and active you can find the "Master Public DNS Name" needed to SSH into the cluster:

Image Added

SSH into Hive Cluster

Once the cluster is active SSH into the master node using the prod-key-pair

Code Block
ssh -i prod-key-pair.pem hadoop@ec2-54-242-184-13.compute-1.amazonaws.com

Once connected to the Hive master node, start a new interactive hive session by typing "hive"

Code Block
hadoop@ip-10-28-72-37:~$ hive