Document toolboxDocument toolbox

Notes on getting size of files within set of projects + associated downloads within time period

 

For storage size for projects, we just get all the files entities in these projects and follow all the way to filehandles to compute the total sizes.
Note1: Files that are not entities? 

# Sum file sizes for files in projects
select n.PROJECT_ID, count(*) as c, sum(CONTENT_SIZE)/1024/1024/1024 as size_gb, sum(CONTENT_SIZE)/1024/1024/1024*0.03 as monthly_cost
from JDONODE n
join JDOREVISION r on r.OWNER_NODE_ID=n.ID
join FILES f on f.ID = r.FILE_HANDLE_ID
where n.PROJECT_ID in (1867134, 2759792, 2787333, 4921369) and n.NODE_TYPE = 'file'
	and f.METADATA_TYPE in ('S3')
group by n.PROJECT_ID with rollup;

For the downloads, we get all the entities and versions downloaded for the time interval, version 0 indicates current version, from the warehouse.
Note1: Could be a difference between current at the time of the download, and current at the time the query is run.
Note2: Handle bulk downloads? 

select par.ENTITY_ID,
	if (instr(ar.`REQUEST_URL`, 'version')=0,
		0,
        substr(
			ar.`REQUEST_URL`,
			instr(ar.`REQUEST_URL`, '/version/')+length('/version/'),
            (instr(rtrim(ar.`REQUEST_URL`), '/file') - (instr(ar.`REQUEST_URL`, '/version/') + length('/version/'))))) as VERSION
from PROCESSED_ACCESS_RECORD par
join ACCESS_RECORD ar on ar.`SESSION_ID`= par.`SESSION_ID` and ar.`TIMESTAMP`=par.`TIMESTAMP`
where par.`TIMESTAMP` > unix_timestamp(date(now()) - interval 30 day) * 1000
	and par.`NORMALIZED_METHOD_SIGNATURE` in ('GET /entity/#/version/#/file', 'GET /entity/#/file')
    and ar.RESPONSE_STATUS in (200, 307)
;

Load results from above in prod database table, replace the 'current' version by the actual current version. The join on the other file data to 
compute the sizes of the downloads.

Note1: If version downloaded has since been deleted, it won't be accounted for.

# Load downloads stats from warehouse, fixing current version
drop table X_FILE_DOWNLOADS;
create table X_FILE_DOWNLOADS(ENTITY_ID bigint(20), VERSION int);
#truncate table X_FILE_DOWNLOADS;
load data local infile '/Users/xavier/temp/file_downloads.csv' into table X_FILE_DOWNLOADS fields terminated by ','	;
update X_FILE_DOWNLOADS d, (select OwNER_NODE_ID, max(NUMBER) as CURRENT_VER from JDOREVISION group by OWNER_NODE_ID) as mx
set d.VERSION=mx.CURRENT_VER
where mx.OWNER_NODE_ID=d.ENTITY_ID and d.VERSION = 0;

select n.PROJECT_ID, sum(CONTENT_SIZE)/1024/1024/1024 as size_gb, sum(CONTENT_SIZE)/1024/1024/1024*0.09 as transfer_cost
from JDONODE n
join JDOREVISION r on r.OWNER_NODE_ID=n.ID
join FILES f on f.ID = r.FILE_HANDLE_ID
join X_FILE_DOWNLOADS d on d.ENTITY_ID=r.OWNER_NODE_ID and d.VERSION=r.NUMBER
where n.PROJECT_ID in (1867134, 2759792, 2787333, 4921369) and n.NODE_TYPE = 'file'
	and f.METADATA_TYPE in ('S3')
group by n.PROJECT_ID with rollup
;

TODO: Turn both into stored procs when we have all the pieces in the warehouse.