Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
insert overwrite table access_record_local select 
returnObjectId,
elapseMS,
timestamp,
via,
host,
threadId,
userAgent,
queryString,
sessionId,
xForwardedFor,
requestURL,
userId,
origin,
date,
method,
vmId,
instance,
stack,
success from access_record_s3 where datep > '2013-09-14';

 

 The above statement produced the following:

Code Block
MapReduce Total cumulative CPU time: 59 seconds 600 msec
Ended Job = job_201309172347_0005
Counters:
Loading data to table default.access_record_local
Deleted hdfs://10.28.72.37:9000/mnt/hive_0110/warehouse/access_record_local
Table default.access_record_local stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 521583756, raw_data_size: 0]
1833436 Rows loaded to access_record_local
MapReduce Jobs Launched:
Job 0: Map: 72   Cumulative CPU: 219.31 sec   HDFS Read: 19440 HDFS Write: 521583756 SUCCESS
Job 1: Map: 3   Cumulative CPU: 59.6 sec   HDFS Read: 521594463 HDFS Write: 521583756 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 38 seconds 910 msec
OK
Time taken: 329.334 seconds
hive>

Running a similar count query against the local table:

Code Block
select count(*) from access_record_local;

Produced:

Code Block
MapReduce Total cumulative CPU time: 24 seconds 180 msec
Ended Job = job_201309172347_0006
Counters:
MapReduce Jobs Launched:
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 24.18 sec   HDFS Read: 521592826 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 180 msec
OK
1833436
Time taken: 65.007 seconds, Fetched: 1 row(s)
hive>

For this example, the count query against the local table ran twice as fast as the same query against the external table (65 secs vs 175 secs).

Running Analysis

Once the tables are setup an populated with data, the real analysis can start.  Here is an example query used to the find the distinct "userAgent" strings used to make calls and the counts for each:

Code Block
select count(userAgent), userAgent from access_record_local group by userAgent;

Here are the results from the above query:

Code Block
MapReduce Total cumulative CPU time: 31 seconds 120 msec
Ended Job = job_201309172347_0007
Counters:
MapReduce Jobs Launched:
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 31.12 sec   HDFS Read: 521592826 HDFS Write: 2563 SUCCESS
Total MapReduce CPU Time Spent: 31 seconds 120 msec
OK
47319
8527    "Jakarta Commons-HttpClient/3.1"
285949  "Synpase-Java-Client/12.0-2-ge17e722  Synapse-Web-Client/12.0-4-gafe76ad "
6615    "Synpase-Java-Client/12.0-2-ge17e722"
6767    "Synpase-Java-Client/13.0"
508     "Synpase-Java-Client/develop-SNAPSHOT  JayClearingPreviews"
1498    "Synpase-Java-Client/develop-SNAPSHOT  Synapse-Web-Client/develop-SNAPSHOT"
145     "Synpase-Java-Client/develop-SNAPSHOT"
1       "python-requests/1.2.0 CPython/2.7.2 Darwin/11.4.2"
1190    "python-requests/1.2.0 CPython/2.7.4 Darwin/11.4.2"
76      "python-requests/1.2.3 CPython/2.7.2 Darwin/12.4.0"
9       "python-requests/1.2.3 CPython/2.7.3 Linux/3.2.0-36-virtual"
16      "python-requests/1.2.3 CPython/2.7.3 Linux/3.5.0-34-generic"
176     "python-requests/1.2.3 CPython/2.7.3 Linux/3.9.10-100.fc17.x86_64"
32      "python-requests/1.2.3 CPython/2.7.4 Darwin/12.4.0"
1223263 "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-19-generic"
96      "python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-26-generic"
8464    "python-requests/1.2.3 CPython/2.7.5 Linux/3.10.10-200.fc19.x86_64"
2       "python-requests/1.2.3 CPython/2.7.5 Windows/7"
4       "synapseRClient/0.26"
494     "synapseRClient/0.27"
691     "synapseRClient/0.28"
126     "synapseRClient/0.29-1"
238765  "synapseRClient/0.30-1"
1458    "synapseRClient/0.31-1"
2       "synapseRClient/0.31-2"
1131    "synapseRClient/0.31-3"
Time taken: 80.385 seconds, Fetched: 43 row(s)
hive>

Cleanup

Do not forget to terminate the cluster you created when you are finished with it.

Image Added