Instant Exporting
This is documentation for Bridge Instant Exporting service.
Scenarios for Bridge Exporter
Migration strategy:
As a bootstrapping process, modify original codes to let daily and hourly exporter modify lastExportDateTime with exported studies -- so that the ddb exportTime table will contain correct most-up-to-date last export date time;
Also, add two new fields into sqs request: "exportType" and "ignoreLastExportTime":
- exportType: specify what kind of task the request wants to proceed: DAILY, HOURLY, INSTANT or s3override (note: s3override will have null value for this field);
- ignoreLastExportTime: in v1 Instant Exporting, it will be used to determine if exporter needs to modify exportTime table – if it is set to true (case when re-export, re-drive for table), it will not modify that table at all;
- Note: for v1, we need startDateTime and endDateTime for both DAILY and HOURLY but will not remove 'date' field in daily export to mitigate side effects;
then we proceed migration --- the next time it exports under v2 instant exporting, exporter will be able to use the correct time range as expected;
normal daily exporting:
get studyid list by scanning the ddbStudyTtable (extract study id from each study) -- we can make sure it contains newest study;
lookup last export date time with given study id in export time table and put this study id into new map with stud is as key and last export date time as value;
if there is no such study in export time table, determine the startDateTime by looking at the ‘exportType’ field in request:
if it is ‘daily’ -- we set startDateTime to 24 hours before the given endDateTime
then rangekey query the studyUploadedOnIndex and return records to export;
finally update export time table with new study export date time;
Note: if there is a one-time export in the middle of the day, since each exporting task will modify the lastExportDateTime value in exportTimeTable, the daily exporting will only query records from lastExportDateTime to given endDateTime. Also, since it will update exportTimeTable as well, every exporting task thereafter will only need to query from previous endDateTime (as lastExportDateTime in exportTimeTable) to given endDateTime as well;
Daily exporting sqs msg example:
{
"endDateTime":"2016-10-04T23:59:59Z",
"exportType":"DAILY",
"tag":"test exporter"
}
normal hourly exporting: -- will have study whitelist
lookup last export date time with given study id in export time table and put this study id into new map with stud is as key and last export date time as value;
if there is no such study in export time table, look up ‘exportType’ field in sqs request -- if it is ‘hourly’, set startDateTime to 1 hour before the endDateTime;
then rangekey query the studyUploadedOnIndex and return records to export;
finally update export time table with new study export date time;
Note: for cases when previous scheduled one-hour exporting executed after followed scheduled export task, e.g. export for 3-4pm then for 2-3pm, since it will export all records from last export date time, the export for 3-4pm will export all records from 2-4pm and the out-ordered export for 2-3pm will export nothing if it is executed (do not throw exception); -- similar case for daily export;
Note: similarly, if there is a one time export happened between two one-hour exports, e.g. one-hour for 1-2pm, then one-time for 2:40pm and then one-hour for 2-3pm the second one-hour export will export all record from 2:40pm to 3pm but not record before 2:40pm;
Hourly exporting sqs request example:
{
"endDateTime":"2016-10-04T23:59:59Z",
"exportType":"HOURLY",
"studyWhitelist":["api"],
"tag":"test exporter"
}
one-time exporting: -- only has one study id in study whitelist
Data workflow for instant-exporting
BSM: instant exporting button in BSM: when clicking the button, only export current study’s data to Synapse;
BridgePF:
API: POST /v3/instantExport
both RESEARCHER and DEVELOPER can call this API
Controller: InstantExportController extends BaseController
method: setInstantExportService(InstantExportService instantExportService);
method: requestInstantExport(String endDateTimeStr);
return Result object;
Service: interface InstantExportService
Service: InstantExportViaSqsService implements InstantExportService:
method: void export(@Nonnull StudyIdentifier studyIdentifier, @Nonnull DateTime endDateTime);
logic:
wrap current studyId (wrap into a JsonArray node) as json node and send it to given sqs url -- see example below:
One-time exporting sqs msg example:
{
"studyWhitelist":["api"],
"exportType":"INSTANT",
"tag":"test exporter"
}
Config:
add a bean ddbExportTimeTable
Bridge Exporter:
lookup last export date time with given study id in export time table as startDateTime;
put this study id into new map with study id as key and last export date time as value;
if there is no such study in export time table, put a default startDateTime as map’s value (like the midnight of given endDateTime);
set query endDateTime to 1 min before right now -- to avoid clock skew issue in distributed systems;
then rangekey query the studyUploadedOnIndex and return records to export
finally update export time table with new study export date time;
re-export:
with ignoreLastExportTime flag to true;
add an optional start date time field to indicate the time range;
startDateTime cannot exist with exportType in one request;
note: if there is no start date time, it will use default start date time by using given exportType;
for daily re-export:
Similar to normal daily export but instead of looking up lastExportDateTime, it will only set startDateTime to 24 hours before the given endDateTime, or using given startDateTime;
for hourly re-export:
Similar to normal cases but instead of looking up lastExportDateTime, it will only set startDateTime to 1 hour before the given endDateTime, or using given startDateTime;
no use case for one-time re-export;
then rangekey query the studyUploadedOnIndex and return records to export;
and not update export time table at all;
Re-export sqs request msg example:
{
"endDateTime":"2016-10-04T23:59:59Z",
"startDateTime":"2016-10-03T23:59:59Z",
"exportType":"DAILY",
"ignoreLastExportTime":"true",
"tag":"test exporter"
}
re-drive:
for redriving tables, set ignoreLastExportTime flag to true;
then identical to re-export logic;
change codes in ExportWorkerManager;
identical sqs msg as shwon in re-export;
for redriving records, do not change anything (since it goes with s3override);
failed export:
since failed exporting will not be able to update export Time table in ddb, the subsequent export task, whatever type it is (one-time or daily), will export all records from last export date time to given end date time -- that means it will export all records what should be exported for the failed exporting task. And retried failed export will export 0 record;
if the retired failed export went well before the next new request, it will update lastExportDateTime and the next request can proceed as normal;
do not need to change any codes;
identical export tasks
since we assume Exporter is a single-machine server and it can only deal with one request at a time, if exporter receive two identical request at the same time, it can only deal with one of them and if successfully exported, since the lastExportDateTime already being updated, the next identical request will export 0 record;
No codes need to change;
normal override export:
because we don’t use time range at all, we are by default not using lastExportDateTime;
will not update last export date time;
proceeds as original logic;
special cases: delete study:
should remove corresponding field in export time table as well;
change codes in BridgePF deleteStudy;
General requirements
add sqs dependency in bridge pf -- refer to udd;
export only from last export date time to now:
need to create a separate table from study table -- it’s an update frequent behavior -- conflict with what we want for study table:
only contain two columns: studyId, lastExportDateTime;
everytime exporter done exporting, update above new table;
note: all sqs request will NOT contain startDateTime, but add an extra field “exportType” to indicate which type the request is -- ‘instant’, ‘daily’ or ‘hourly’ --
so that when we need to use re-export (i.e. set ignoreLastExportDateTime to true), we can determine what time range exporter needs to query for that request;
normal cases will not use ‘exportType’ at all and always fetch lastExportDateTime from ddb table;
Exporter changes
mostly in RecordIdSourceFactory:
all daily, hourly and instant export request have field ‘exportType’;
Daily and hourly requests without override have end date time;
check if it has white list:
if yes: just use study whitelist as study id list;
if no: scan whole Study table to get study id list;
then get last export date time from export time table:
if given study id field does not exists: just use generated start date time as last export date time;
query items by iterating study id list with given last export date time and end date time;
if endDateTime is before lastExportDateTime, just return an empty list for that study;
BridgeExporterRecordProcessor:
update export time table with new last export date time if the export succeeds (and if ignoreLastExportTime is false);
modify ExportWorkerManager
build request with ignoreLastExportTime flag if it is a re-drive;
Scheduler changes
for daily exporting:
add endDateTime;
add field ‘exportType’ to ‘DAILY’;
remove ‘Date’ field;
for hourly exporting:
remove startDateTime;
add field ‘exportType’ to ‘HOURLY’;
Bridge Docs and JDK
add new API in usual way;
BridgePF integration test
add an extra instant exporting test;
Test Cases
normal case: daily export without ignoreLastExportTime
check if it use both date and datetime range
lastExportDateTime exists -- check if it use lastExportDateTime;
does not exist -- check if it use correct startDateTime;
normal case: one-time export without ignoreLastExportTime
lastExportDateTime exists -- check if it use it;
does not exist -- check if it use correct startDateTime;
normal case: hourly export without ignoreLastExportTime
lastExportDateTime exists -- check if it use it;
does not exist -- check if it use correct startDateTime;
special case: daily export with ignoreLastExportTime
check if it use correct startDateTime instead of lastExportDateTime;
check if it did not modify ddb table
special case: hourly export with ignoreLastExportTime
check if it use correct startDateTime instead of lastExportDateTime;
check if it did not modify ddb table
special case: s3-override:
check if it did not modify ddb table
special case: re-drive:
for redriving records -- check if it did not modify ddb table;
for redriving tables:
check if it use correct startDateTime instead of lastExportDateTime;
check if it did not modify ddb table;
special case: endDateTime is before lastExportDateTime:
check if it modify the export time table to endDateTime;
special case: failed export: the only way is to test it manually in local environment