As of 11/28/2012, what we have is a set of log files that get pushed to S3 at one minute intervals. Unfortunately, this leads to a large number of very small files which is a relatively difficult format to work with.
As a TEMPORARY measure, I've set up an EBS volume where we can mirror the S3 logs for faster computational access. Because the files are so small, the biggest barrier to accessing them is that it takes a long time to download them from S3 because of how many connections need to be setup and torn down. So the advantage of the EBS volume is it can be attached to a small or micro instance for updating with the latest files from S3, and then attached to a Large instance for short periods of time to run queries over the data or process it in some way.
So, the way I've been doing this is with Starcluster. There is a file at s3://logs.sagebase.org/conf/starcluster-config that has a cluster named logs that will setup a single instance cluster with the correct parameters, attach the EBS volume and let you ssh into it.
Step by step:
- Install Starcluster
- Download the config file from s3://logs.sagebase.org/conf/starcluster-config and install it into your .starcluster/ folder under the name "config"
- Get the PlatformKeyPairEast.pem file from somebody (might be on sodo?)
- Edit the config file, replace the path ~/.ssh/PlatformKeyPairEast.pem to the path to that actual file.
- run 'starcluster start logs' and wait for it to finish
- run 'starcluster sshmaster logs'
Optionally, if you want to run computation's on the logs, then you should probably change the size from m1.small to m1.large or bigger.
To update with the most recent logs once you've ssh'd to the node:
- run 'aptitude install s3cmd'
- run 's3cmd --configure'
- configure with the accesskey and secret key from the starcluster config file
- determine which stack has been producing logs (we currently have all the logs from a-f)
- This next command should probably be run from inside a screen session (so you can detach from it and reattach later)
run
s3cmd sync -r --skip-existing s3://logs.sagebase.org/prod/{stack}/ /home/logs/{stack}
You may need to create the folder if it doesn't exist already. Don't be alarmed if it produces no output for quite a while. The first stage it does is to get a list of all files it will be downloading. Once it starts downloading there will be plenty of output.