Repository Backup & Restore V2

Problems with the current Backup/Restore process

To better understand the current problems, here is a review of how the current backup and restore process works (also see: http://sagebionetworks.jira.com/wiki/display/PLFM/Repository+Administration):

Backup - The following is current backup process:
1. A daemon thread is started on the 'source' repository tomcat server.
2. The backup daemon will stream all entity, acl, user, and group data to a single temporary zip file. This zip file is composed of a directory tree that matches the tree structure of the entities with the root project at the root. There is one xml file for each entity containing the ACL and basic non-verionable entity data. There is also one xml file for each revision of an entity that contains all versionable data like annotations and references.
3. Once the backup file is complete it is pushed to the stack's S3 bucket.
Restore - The following is the current restore process:
1. A daemon thread is started on the 'destination' repository tomcat server.
2. The restore daemon starts by downloading a backup file from its stack's S3 bucket to a temporary file.
3. The restore daemon will then stream from the temporary file starting at the root of the directory tree. It will then write one entity at a time (each entity is committed in its own DB transaction) to the database, doing any data migration required as it goes.

When we setup this process we knew that starting daemon threads on a tomcat server would only work if the daemon was fairly short lived (say less than a few minutes). With a short process it is simple to recover from failures such as the elastic instance restarts. The problem is we had a data explosion in a very short period to time. We went from having about 2K entities to over 52K entities. With 53K entities it would take about 15 minutes to create a backup file, and an estimated 5 hours to complete the restore process. Since the current process is all-or-none, failing 2 hours into the restore job means restarting form the beginning. More importantly, the current process requires that the repository services be put in a read-only state during the entire migration process. Bringing our servers for 5 hours of migration is less than ideal.

Goals for the Backup & Restore V2

Migrate data with a minimum amount of down time. Ideally the servers would be in 'read-only' mode for only a few minutes or less.
The migration process needs to be more robust. Failure should be recoverable and not require a restart of the process.

Plan for Backup & Restore V2

Client Daemon

Rather than use long running daemon threads on the repository tomcat servers we are proposing using a stateless client daemon to drive both backup and recovery. At a high-level the client daemon will communicate with both a 'source' repository and a 'destination' repository, incrementally triggering data to migrate from the 'source' to 'destination' with a small batch of entities at a time. We will need to determine the optimum number of entities to migrate per batch.

Note: The source repository can remain in a read/write mode while the client daemon is running.

The client daemon will run a very simple loop. Here is a description of the loop:

Find a single entity in the sources that does not exist in the destination and call this entity 'n'.
1. If all entity from the source already exists in destination find a single entity from the source that has a mismatched eTag in the destination and call this entity 'n'.
  1. If all entities exist and there are no mismatched eTags find any entity in the destination that does not exist in the source and delete it.
If n's parent entity does not exist in the destination add the parent to the batch (this is recursive to root).
Add n to the batch of entities to migrate.
If the batch size at optimum, trigger the migration of the batch.
Sleep
continue to the beginning of the loop.

The client daemon should print its status to a log. By monitoring this status the we should be able to detect when the daemon source and destination repositories are close to being in synch. Once this occurs, the source repository can be put into a read-only mode. Once the daemon reports that the source and destination are in synch while the source is in read-only mode, we are ready to shut down the sources repository and swap the CNAMEs of the destination repository. If the swap occurs at a low traffic time, then the downtime should be minimal.

Migration of a Batch of Entities

We are proposing that this works exactly like a full repository backup but on a smaller scale. Rather than creating a zip file that contains all entities from a repository we create zip file containing all entities in a given batch. Here is the order of events to migrate a batch of entities:

Start a backup daemon on the source repository passing the list of entity ids to be included in the backup file.
Monitor the progress of the backup daemon.
1. If the backup fails or stops making progress go back to step one.
Once the backup daemon finishes, start a restore daemon on the destination repository passing the S3 URL of the backup file from step 2.
Wait for the restore daemon to finish.
1. If the restore fails or stops making progress go back to step three.

Create batch backup

Use the administrator's token to start the backup of a batch of entities. You must provide the list of entity ids that should be included in the batch:
Request

curl -i -k -H sessionToken:YourSessionToken -H Accept:application/json -H Content-Type:application/json -d '{
  "entityIds": ["123","456","789"]
}' http://localhost:8080/services-repository-0.6-SNAPSHOT/repo/v1/entity/backup/daemon

Response

HTTP/1.1 201 Created
Server: Apache-Coyote/1.1
Content-Type: application/json
Transfer-Encoding: chunked
Date: Thu, 18 Aug 2011 23:31:28 GMT
{
	"id":"6695",
	"type":"BACKUP",
	"status":"STARTED",
	"errorMessage":null,
	"progresssMessage":"Starting...",
	"progresssCurrent":0,
	"progresssTotal":0,
	"errorDetails":null,
	"backupUrl":null,
	"totalTimeMS":0,
	"startedBy":"platform@sagebase.org",
	"startedOn":1313707136615
}

Once the daemon is started, its progress can be monitored using its 'id' returned from the call:
Request

curl -i -k -H sessionToken:<your admin token> -H Accept:application/json -H Content-Type:application/json  https://staging-reposervice.elasticbeanstalk.com/repo/v1/entity/backup/daemon/6695

Response

HTTP/1.1 200 OK
Content-Type: application/json
Date: Thu, 18 Aug 2011 22:46:06 GMT
Server: Apache-Coyote/1.1
Content-Length: 1114
Connection: keep-alive

{
	"id":"6696",
	"type":"BACKUP",
	"status":"COMPLETED",
	"errorMessage":null,
	"progresssMessage":"Finished: BACKUP",
	"progresssCurrent":863,
	"progresssTotal":863,
	"errorDetails":null,
	"backupUrl":"https://s3.amazonaws.com/stagingdata.sagebase.org/BackupDaemonJob6696-911306061719227050.zip",
	"totalTimeMS":24880,
	"startedBy":"platform@sagebase.org",
	"startedOn":1313708374613
}

We can see that the backup 'status'='COMPLETED', and that the 'backupUrl' is no longer null and that the entire backup took ~25 seconds to complete. We can now use the file found at the 'backupUrl' to push the batch to our destination repository.

Pushing a batch backup to the destination repository.

Restoring a Repository Service from a backup is just the reverse of a backup. A restore daemon is started that will download the backup file from the service's S3 bucket, and then stream the data into repository.
You must provide the file name of the backup file found on S3 to the daemon:
Request

curl -i -k -H sessionToken:YourSessionToken -H Accept:application/json -H Content-Type:application/json -d '{
  "url": "BackupDaemonJob6696-911306061719227050.zip"
}' http://localhost:8080/services-repository-0.6-SNAPSHOT/repo/v1/entity/restore/daemon

Response

HTTP/1.1 201 Created
Server: Apache-Coyote/1.1
Content-Type: application/json
Transfer-Encoding: chunked
Date: Thu, 18 Aug 2011 23:31:28 GMT

{
	"id":"4",
	"type":"RESTORE",
	"status":"STARTED",
	"progresssMessage":"Starting...",
	"progresssCurrent":0,
	"progresssTotal":0,
	"errorMessage":null,
	"errorDetails":null,
	"backupUrl":null,
	"totalTimeMS":0,
	"startedBy":"platform@sagebase.org",
	"startedOn":1313710288153
}

Once the daemon is started its progress can be monitored in the same way as we monitored the backup daemon, using the 'id' provided by the entity/restore/daemon:
Request

curl -i -k -H sessionToken:YourSessionToken -H Accept:application/json -H Content-Type:application/json  http://localhost:8080/services-repository-0.6-SNAPSHOT/repo/v1/entity/restore/daemon/4

Response

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Transfer-Encoding: chunked
Date: Thu, 18 Aug 2011 23:59:13 GMT

{
	"id":"5",
	"type":"RESTORE",
	"status":"COMPLETED",
	"progresssMessage":"Finished: RESTORE",
	"progresssCurrent":1164611,
	"progresssTotal":1164611,
	"errorMessage":null,
	"errorDetails":null,
	"backupUrl":"https://s3.amazonaws.com/devdata.sagebase.org/BackupDaemonJob6696-5911306061719227050.zip",
	"totalTimeMS":46800,
	"startedBy":"platform@sagebase.org",
	"startedOn":1313711855784
}

We can see that the restore 'status'='COMPLETED', and that the entire restore took ~47 seconds to complete. We have successfully migrated all entities in the batch from the source repository to the destination repository.