Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this example,  we created a table using prod-12 data as the source for the table (see: line:24 LOCATION 's3://prod.access.record.sagebase.org/000000009').  Make sure to set the location to the stack data you want to analyses.

At this point our table 'access_record_s3' exists but it does not contain any data.  This is because we must manually add a partition for each date we want to load into the table.  By default this is no data in a partitioned table.  Lets add some partitions to this external table:

Code Block
ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-15') location 's3://prod.access.record.sagebase.org/000000012/2013-09-15';
ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-16') location 's3://prod.access.record.sagebase.org/000000012/2013-09-16';
ALTER TABLE access_record_s3 ADD PARTITION(datep='2013-09-17') location 's3://prod.access.record.sagebase.org/000000012/2013-09-17';

Note: Adding the current day's data as a partition will be problematic. The files for the current day are in flux as the worker will constantly merges the small files into larger files. This will cause any hive queries to fail.

The above command will add all access record data from 2013-09-15 through 2013-09-17 to our external table 'access_record_s3'.  We can now run a query to validate there is data in our external table:

Code Block
select count(*) from access_record_s3;

The above query produced the following:

Code Block
MapReduce Total cumulative CPU time: 1 minutes 18 seconds 600 msec
Ended Job = job_201309172347_0003
Counters:
MapReduce Jobs Launched:
Job 0: Map: 72  Reduce: 1   Cumulative CPU: 78.6 sec   HDFS Read: 19440 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 18 seconds 600 msec
OK
1833436
Time taken: 175.089 seconds, Fetched: 1 row(s)
hive>

We now have an external table configured with three days worth of access data.  However, any query we run against this external table will be slow as Hive must download all of the data from S3 each time a query runs.  To speed things up we can copy all of the data from our external table into a table located directly on the EMR Hive cluster.  First create the local table:

Code Block
CREATE TABLE access_record_local ( 
returnObjectId string,
elapseMS int,
timestamp int,
via string,
host string,
threadId int,
userAgent string,
queryString string,
sessionId string,
xForwardedFor string,
requestURL string,
userId int,
origin string,
date string,
method string,
vmId string,
instance string,
stack string,
success string
 )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

Now copy all data from the external table into the local table:

Code Block
insert overwrite table access_record_local select 
returnObjectId,
elapseMS,
timestamp,
via,
host,
threadId,
userAgent,
queryString,
sessionId,
xForwardedFor,
requestURL,
userId,
origin,
date,
method,
vmId,
instance,
stack,
success from access_record_s3 where datep > '2013-09-14';