Log File Sleuthing

Here are the instructions for easily searching through the logs that we have gathered on S3. Note: these instructions are only useful as long as there isn't a real system for collating/mapreducing the logs.

Important points:

Our log files are stored in the S3 US Standard Region, this means that in order to avoid data transfer costs you should create any EC2 instances in the N. Virginia region.
We can't currently (9/20/2012) tell the difference between logs generated by a staging stack and logs generated by a production stack

Basic outline:

Create ec2 instance
Download log files of interest to instance
run analysis/search over files
Clean Up!

Setup the ec2 instance through the amazon console, then ssh into it using your tool of choice (ssh for *nix-like systems, putty for windows).

The tool that I find easiest to use for getting files from S3 easily and simply is called s3cmd. It's a simple command line utility that allows you to configure a credential set, and then acts very much like the standard unix CLI utilities. Also, this tool is easy to get on ec2, being available through a variety of linux package managers. Look to the s3cmd website for installation instructions for the type of AMI you setup.

Once you have s3cmd you need to configure it by running

s3cmd --configure

This will ask you a whole bunch of questions, all of which you can accept the defaults for. The credentials you should use are in the file work/platform/PasswordsAndCredentials/IAMUsers/credentials.csv on the devServiceIAMUser line.

Now that it's configured, you can use it to access our s3 files. The bucket containing the log files is "logs.sagebase.org". You can list the contents of it like this:

s3cmd ls s3://logs.sagebase.org

To download files (to the current directory "./") use get:

s3cmd get s3://logs.sagebase.org/<path-to-file> ./

Now, to download an entire folder, just add the --recursive option.

s3cmd get --recursive s3://logs.sagebase.org/prod/ ./

To selectively include or exclude files using shell globs. Note that the *'s are enclosed in single-quotes to prevent the shell from expanding them.

s3cmd get --recursive --include '*repo-activity*' --exclude '*repo-trace-profile*' s3://logs.sagebase.org/prod/ ./

You can also use regular expressions instead of shell globs with --rinclude and --rexclude.

It's also often a good idea to try out your command and make sure it's giving you the files you want by using the --dry-run option.

Now, since we gzip all the files, you need to unzip them. I did this like so:

find ./ -name '*.gz' | xargs gunzip

Depending on the number of files, this could take quite a while.

Okay, now you have the files you can slice, dice, chop and search to your hearts content using whatever tools you prefer. Don't forget to cleanup your ec2 instance when you're done. Also, make sure you're REALLY done before you do

Synapse Platform

Log File Sleuthing

Analytics

Important points:

Basic outline: