Auditing data for the Synapse REST API is captured by a Spring Interceptor: AccessInterceptor that is similar to web filter. This interceptor is configured to listen to all web services calls made to the repository services. For each call, the AccessInterceptor will gather data to fill out an AccessRecord model object. The AccessRecord data is then written as zipped CSV files directly to the S3 bucket. These CSV files are initially too small to process efficiently so a worker process merges the files by hour.
All AccessRecord CSV data for a single hour from all EC2 instance of a stack are into a single file. The following is an example of the resulting path:
https://s3.amazonaws.com/prod.access.record.sagebase.org/000000013/2013-09-17/14-39-03-055-75ed6416-438d-4688-8789-3df56f4e4670.csv.gz |
The above path is composed of the following parts:
https://s3.amazonaws.com/prod.access.record.sagebase.org/<instance_number>/<year_month_day>/<hour_minutes_seconds_miliseconds>-<UUID>.csv.gz |
Here is some sample data from one of the access record files:
returnObjectId | elapseMS | timestamp | via | host | threadId | userAgent | queryString | sessionId | xForwardedFor | requestURL | userId | origin | date | method | vmId | instance | stack | success |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | 1379430033942 | repo-prod-13.prod.sagebase.org | 659 | Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0 | b4331f55-6c65-4f6b-a2c5-ee6830cf7641 | /repo/v1/entity/header | 273978 | 2013-09-17 | POST | eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd | 13 | prod | true | |||||
30 | 1379430033943 | repo-prod-13.prod.sagebase.org | 656 | Jakarta Commons-HttpClient/3.1 | query=select+id,name,nodeType+from+entity+where+parentId+==+%22syn2228808%22+limit+500+offset+1 | b1b9c385-4dba-4e3a-b49c-0f40f7c99ac5 | /repo/v1/query | 273978 | 2013-09-17 | GET | eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd | 13 | prod | true | ||||
14 | 1379430034027 | repo-prod-13.prod.sagebase.org | 1177 | Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0 | mask=64 | 597767ef-8ff2-40d0-a65d-b519f5b2f937 | /repo/v1/entity/syn2228808/bundle | 273978 | 2013-09-17 | GET | 9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd | 13 | prod | true | ||||
syn2228807 | 35 | 1379430034057 | repo-prod-13.prod.sagebase.org | 159 | Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0 | e9b15054-dbc6-454b-a1bf-8bef3d5f0fbc | /repo/v1/entity/syn2228808/benefactor | 273978 | 2013-09-17 | GET | 9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd | 13 | prod | true | ||||
syn2228807 | 19 | 1379430034107 | repo-prod-13.prod.sagebase.org | 153 | Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0 | 23216ee3-dade-43ac-8efe-fa1e6dc9877d | /repo/v1/entity/syn2228807/acl | 273978 | 2013-09-17 | GET | 9b5a47b65e8703f0:229cd7a3:1412973c18a:-7ffd | 13 | prod | true | ||||
59638 | 39 | 1379430034123 | repo-prod-13.prod.sagebase.org | 656 | Synpase-Java-Client/2013-09-13-e70558e-662 Synapse-Web-Client/13.0 | d7ed19dd-2ed9-47d1-b345-be1aaca0d688 | /repo/v1/entity/syn2228808/wiki | 273978 | 2013-09-17 | GET | eca4eb39c13ac98c:7461944:1412973c3ee:-7ffd | 13 | prod | true |
Analysis of the audit data can be done using EMR-Hive. The AWS extensions to Hive include support for reading zipped CSV data directly from S3. This means we can launch an Elastic Map Reduce (EMR) Hive cluster and copy all access record data from S3 to the Hive cluster. Once the data is loaded on the cluster add-hock and canned quires can be executed to generate reports or discover new trends. The following outlines how to get started with the analysis of the Synapse Audit data.
Launch a new EMR Hive cluster (see Launch a Hive Cluster for more information).
Log into the Production AWS account and navigate to the Elastic MapReduce Service page and select "Create New Job Flow":
Select a Hive Program:
Choose "Start an Interactive Hive Session"
Choose the number and types of the EC2 instances for the cluster. There must be at least one "Master Instance Group" instance and two "Core Instance Group" instances. Request Spot Instance.
Setup the Amazon EC2 Key Pair to "prod-key-pair". You will then need to use the prod-key-pair to SSH into the cluster.
No bootstrap is needed:
Final check for the cluster configuration. Select "Create Job Flow" to launch the cluster:
Once the cluster is launched and active you can find the "Master Public DNS Name" needed to SSH into the cluster:
Once the cluster is active SSH into the master node using the prod-key-pair (see: AccessingInstancesLinux for more information). Note: the user name must be "hadoop":
ssh -i prod-key-pair.pem hadoop@ec2-54-242-184-13.compute-1.amazonaws.com |
Once connected to the Hive master node, start a new interactive hive session by typing "hive"
hadoop@ip-10-28-72-37:~$ hive |
Once you have an interactive Hive session with the master node of the cluster you are ready to setup the tables that will be used for analysis. First we must create the external table for access record CSV data in S3:
CREATE EXTERNAL TABLE access_record_s3 ( returnObjectId string, elapseMS int, timestamp int, via string, host string, threadId int, userAgent string, queryString string, sessionId string, xForwardedFor string, requestURL string, userId int, origin string, date string, method string, vmId string, instance string, stack string, success string ) PARTITIONED BY ( datep string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION 's3://prod.access.record.sagebase.org/000000012'; |
In this example, we created a table using prod-12 data as the source for the table (see: line:24 LOCATION 's3://prod.access.record.sagebase.org/000000009'). Make sure to set the location to the stack data you want to analyses.
At this point our table 'access_record_s3' exists but it does not contain any data. This is because we must manually add a partition for each date we want to load into the table. By default this is no data in a partitioned table. Lets add some partitions to this external table:
ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-15') location 's3://prod.access.record.sagebase.org/000000012/2013-09-15'; ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-16') location 's3://prod.access.record.sagebase.org/000000012/2013-09-16'; ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-17') location 's3://prod.access.record.sagebase.org/000000012/2013-09-17'; |
Note: Adding the current day's data as a partition will be problematic. The files for the current day are in flux as the worker will constantly merges the small files into larger files. This will cause any hive queries to fail.
The above command will add all access record data from 2013-09-15 through 2013-09-17 to our external table 'access_record_s3'. We can now run a query to validate there is data in our external table:
select count(*) from access_record_s3; |
The above query produced the following:
MapReduce Total cumulative CPU time: 1 minutes 18 seconds 600 msec Ended Job = job_201309172347_0003 Counters: MapReduce Jobs Launched: Job 0: Map: 72 Reduce: 1 Cumulative CPU: 78.6 sec HDFS Read: 19440 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 18 seconds 600 msec OK 1833436 Time taken: 175.089 seconds, Fetched: 1 row(s) hive> |
We now have an external table configured with three days worth of access data. However, any query we run against this external table will be slow as Hive must download all of the data from S3 each time a query runs. To speed things up we can copy all of the data from our external table into a table located directly on the EMR Hive cluster. First create the local table:
CREATE TABLE access_record_local ( returnObjectId string, elapseMS int, timestamp int, via string, host string, threadId int, userAgent string, queryString string, sessionId string, xForwardedFor string, requestURL string, userId int, origin string, date string, method string, vmId string, instance string, stack string, success string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; |
Now copy all data from the external table into the local table:
insert overwrite table access_record_local select returnObjectId, elapseMS, timestamp, via, host, threadId, userAgent, queryString, sessionId, xForwardedFor, requestURL, userId, origin, date, method, vmId, instance, stack, success from access_record_s3 where datep > '2013-09-14'; |
The above statement produced the following:
MapReduce Total cumulative CPU time: 59 seconds 600 msec Ended Job = job_201309172347_0005 Counters: Loading data to table default.access_record_local Deleted hdfs://10.28.72.37:9000/mnt/hive_0110/warehouse/access_record_local Table default.access_record_local stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 521583756, raw_data_size: 0] 1833436 Rows loaded to access_record_local MapReduce Jobs Launched: Job 0: Map: 72 Cumulative CPU: 219.31 sec HDFS Read: 19440 HDFS Write: 521583756 SUCCESS Job 1: Map: 3 Cumulative CPU: 59.6 sec HDFS Read: 521594463 HDFS Write: 521583756 SUCCESS Total MapReduce CPU Time Spent: 4 minutes 38 seconds 910 msec OK Time taken: 329.334 seconds hive> |
Running a similar count query against the local table:
select count(*) from access_record_local; |
Produced:
MapReduce Total cumulative CPU time: 24 seconds 180 msec Ended Job = job_201309172347_0006 Counters: MapReduce Jobs Launched: Job 0: Map: 3 Reduce: 1 Cumulative CPU: 24.18 sec HDFS Read: 521592826 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 24 seconds 180 msec OK 1833436 Time taken: 65.007 seconds, Fetched: 1 row(s) hive> |
For this example, the count query against the local table ran twice as fast as the same query against the external table (65 secs vs 175 secs).
Once the tables are setup an populated with data, the real analysis can start. Here is an example query used to the find the distinct "userAgent" strings used to make calls and the counts for each:
select count(userAgent), userAgent from access_record_local group by userAgent; |
Here are the results from the above query:
MapReduce Total cumulative CPU time: 31 seconds 120 msec Ended Job = job_201309172347_0007 Counters: MapReduce Jobs Launched: Job 0: Map: 3 Reduce: 1 Cumulative CPU: 31.12 sec HDFS Read: 521592826 HDFS Write: 2563 SUCCESS Total MapReduce CPU Time Spent: 31 seconds 120 msec OK 47319 8527 "Jakarta Commons-HttpClient/3.1" 285949 "Synpase-Java-Client/12.0-2-ge17e722 Synapse-Web-Client/12.0-4-gafe76ad " 6615 "Synpase-Java-Client/12.0-2-ge17e722" 6767 "Synpase-Java-Client/13.0" 508 "Synpase-Java-Client/develop-SNAPSHOT JayClearingPreviews" 1498 "Synpase-Java-Client/develop-SNAPSHOT Synapse-Web-Client/develop-SNAPSHOT" 145 "Synpase-Java-Client/develop-SNAPSHOT" 1 "python-requests/1.2.0 CPython/2.7.2 Darwin/11.4.2" 1190 "python-requests/1.2.0 CPython/2.7.4 Darwin/11.4.2" 76 "python-requests/1.2.3 CPython/2.7.2 Darwin/12.4.0" 9 "python-requests/1.2.3 CPython/2.7.3 Linux/3.2.0-36-virtual" 16 "python-requests/1.2.3 CPython/2.7.3 Linux/3.5.0-34-generic" 176 "python-requests/1.2.3 CPython/2.7.3 Linux/3.9.10-100.fc17.x86_64" 32 "python-requests/1.2.3 CPython/2.7.4 Darwin/12.4.0" 1223263 "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-19-generic" 96 "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-26-generic" 8464 "python-requests/1.2.3 CPython/2.7.5 Linux/3.10.10-200.fc19.x86_64" 2 "python-requests/1.2.3 CPython/2.7.5 Windows/7" 4 "synapseRClient/0.26" 494 "synapseRClient/0.27" 691 "synapseRClient/0.28" 126 "synapseRClient/0.29-1" 238765 "synapseRClient/0.30-1" 1458 "synapseRClient/0.31-1" 2 "synapseRClient/0.31-2" 1131 "synapseRClient/0.31-3" Time taken: 80.385 seconds, Fetched: 43 row(s) hive> |
Do not forget to terminate the cluster you created when you are finished with it.