Content Comparison

Introduction

Auditing data for the Synapse REST API is captured by a Spring Interceptor: AccessInterceptor that is similar to web filter. This interceptor is configured to listen to all web services calls made to the repository services. For each call, the AccessInterceptor will gather data to fill out an AccessRecord model object. The AccessRecord data is then written as zipped CSV files directly to the prod.access.record.sagebase.org S3 bucket. These CSV files are initially too small to process efficiently so a worker process merges the files by hour.

AccessRecord S3 Files

All AccessRecord CSV data for a single hour from all EC2 instance of a stack are into a single file. The following is an example of the resulting path:

...

returnObjectId	elapseMS	timestamp	host	threadId	userAgent	queryString	sessionId	requestURL	userId	date	method	vmId	instance	stack	success
	43	1379430033942	repo-prod-13.prod.sagebase.org	659	Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0		b4331f55-6c65-4f6b-a2c5-ee6830cf7641	/repo/v1/entity/header	273978	2013-09-17	POST	eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd	13	prod	true
	30	1379430033943	repo-prod-13.prod.sagebase.org	656	Jakarta Commons-HttpClient/3.1	query=select+id,name,nodeType+from+entity+where+parentId+==+%22syn2228808%22+limit+500+offset+1	b1b9c385-4dba-4e3a-b49c-0f40f7c99ac5	/repo/v1/query	273978	2013-09-17	GET	eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd	13	prod	true
	14	1379430034027	repo-prod-13.prod.sagebase.org	1177	Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0	mask=64	597767ef-8ff2-40d0-a65d-b519f5b2f937	/repo/v1/entity/syn2228808/bundle	273978	2013-09-17	GET	9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd	13	prod	true
syn2228807	35	1379430034057	repo-prod-13.prod.sagebase.org	159	Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0		e9b15054-dbc6-454b-a1bf-8bef3d5f0fbc	/repo/v1/entity/syn2228808/benefactor	273978	2013-09-17	GET	9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd	13	prod	true
syn2228807	19	1379430034107	repo-prod-13.prod.sagebase.org	153	Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0		23216ee3-dade-43ac-8efe-fa1e6dc9877d	/repo/v1/entity/syn2228807/acl	273978	2013-09-17	GET	9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd	13	prod	true
59638	39	1379430034123	repo-prod-13.prod.sagebase.org	656	Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0		d7ed19dd-2ed9-47d1-b345-be1aaca0d688	/repo/v1/entity/syn2228808/wiki	273978	2013-09-17	GET	eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd	13	prod	true

Column Description

returnedObjectId - For any method that returns an object with an ID, this column will contain the returned ID. This is the only way to determine the ID of a newly created object from a POST.
elaseMS - The elapse time of the call in milliseconds.
timestamp - The exact time the call was made in epoch time (milliseconds since 1/1/1970).
via - The value of the "via" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
host - The value of the "host" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
threadId- The ID of the thread used to process the request.
userAgent - The value of the "User-Agent" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
queryString - The value of the request queryString.
sessionId - For each call a new UUID is generated for the sessionId. The sessionId is also bound to the logging thread context and written in all log entries. This ties access records to log entries.
xForwardedFor - The value of the "X-Forwarded-For" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
resquestURL - The URL of the request.
userID - For calls where the users is authenticated via a sessionToken or an API key, this column will contain the numeric ID of the user.
origin - The value of the "Origin" header (see: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields).
date - The year-month-day date string.
method - The HTTP method: GET, POST, PUT, DELETE
vmId - When each EC2 instances of a stack starts, a new unique identifier for the JVM is issued. This is captured in the access log so calls form a single machine can be grouped together.
instance - The instance number of the stack.
stack - The stack identifier. This will always be "prod" for production stacks.
success - Set to true when a call complete without an exception, otherwise set to false. The stack trace of exceptions can be found by searching the logs for the the sessionId of any failed access records.

Audit Analysis

Analysis of the audit data can be done using EMR-Hive. The AWS extensions to Hive include support for reading zipped CSV data directly from S3. This means we can launch an Elastic Map Reduce (EMR) Hive cluster and copy all access record data from S3 to the Hive cluster. Once the data is loaded on the cluster add-hock and canned quires can be executed to generate reports or discover new trends. The following outlines how to get started with the analysis of the Synapse Audit data.

Launch a EMR Hive Cluster

Launch a new EMR Hive cluster (see Launch a Hive Cluster for more information).

...

Once the cluster is launched and active you can find the "Master Public DNS Name" needed to SSH into the cluster:

SSH into Hive Cluster

Once the cluster is active SSH into the master node using the prod-key-pair (see: AccessingInstancesLinux for more information). Note: the user name must be "hadoop":

...

Code Block
hadoop@ip-10-28-72-37:~$ hive

Creating Hive Tables

Once you have an interactive Hive session with the master node of the cluster you are ready to setup the tables that will be used for analysis. First we must create the external table for access record CSV data in S3:

...

For this example, the count query against the local table ran twice as fast as the same query against the external table (65 secs vs 175 secs).

Running Analysis

Once the tables are setup an populated with data, the real analysis can start. Here is an example query used to the find the distinct "userAgent" strings used to make calls and the counts for each:

...

Code Block

MapReduce Total cumulative CPU time: 31 seconds 120 msec
Ended Job = job_201309172347_0007
Counters:
MapReduce Jobs Launched:
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 31.12 sec   HDFS Read: 521592826 HDFS Write: 2563 SUCCESS
Total MapReduce CPU Time Spent: 31 seconds 120 msec
OK
47319
8527    "Jakarta Commons-HttpClient/3.1"
285949  "Synpase-Java-Client/12.0-2-ge17e722  Synapse-Web-Client/12.0-4-gafe76ad "
6615    "Synpase-Java-Client/12.0-2-ge17e722"
6767    "Synpase-Java-Client/13.0"
508     "Synpase-Java-Client/develop-SNAPSHOT  JayClearingPreviews"
1498    "Synpase-Java-Client/develop-SNAPSHOT  Synapse-Web-Client/develop-SNAPSHOT"
145     "Synpase-Java-Client/develop-SNAPSHOT"
1       "python-requests/1.2.0 CPython/2.7.2 Darwin/11.4.2"
1190    "python-requests/1.2.0 CPython/2.7.4 Darwin/11.4.2"
76      "python-requests/1.2.3 CPython/2.7.2 Darwin/12.4.0"
9       "python-requests/1.2.3 CPython/2.7.3 Linux/3.2.0-36-virtual"
16      "python-requests/1.2.3 CPython/2.7.3 Linux/3.5.0-34-generic"
176     "python-requests/1.2.3 CPython/2.7.3 Linux/3.9.10-100.fc17.x86_64"
32      "python-requests/1.2.3 CPython/2.7.4 Darwin/12.4.0"
1223263 "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-19-generic"
96      "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-26-generic"
8464    "python-requests/1.2.3 CPython/2.7.5 Linux/3.10.10-200.fc19.x86_64"
2       "python-requests/1.2.3 CPython/2.7.5 Windows/7"
4       "synapseRClient/0.26"
494     "synapseRClient/0.27"
691     "synapseRClient/0.28"
126     "synapseRClient/0.29-1"
238765  "synapseRClient/0.30-1"
1458    "synapseRClient/0.31-1"
2       "synapseRClient/0.31-2"
1131    "synapseRClient/0.31-3"
Time taken: 80.385 seconds, Fetched: 43 row(s)
hive>

Cleanup

Do not forget to terminate the cluster you created when you are finished with it.

...

Version	Old Version 9	New Version 10
Changes made by	John Hill	John Hill
Saved on	Sept 18, 2013	Sept 18, 2013

Versions Compared

Key

Table of Contents