Export dataset file information to data warehouse
This document outlines the approach for exporting information about dataset files to a data warehouse.
Dataset
A dataset is a type of view defined by a flat list of entities, and these entities are files. In Synapse a dataset is considered an entity, and all entities are stored in the Node and Node_Revision table. The Synapse dataset includes details about a list of file IDs with their versions, the overall number of files in the dataset, the total size of the files in bytes, and the checksum of the files.
NodeSnapshots Table
In data warehouse already have nodesnapshots table. Which contain snapshots of nodes (Nodes are project, dataset, file, folders...). Snapshots are taken when nodes are created, updated or deleted. Snapshots are also taken periodically and independently of the changes. The nodesnapshots table provides details about the dataset but does not capture information about the files within that dataset.
Current nodesnapshot table schema:
[
{
"Name": "change_type",
"Type": "string",
"Comment": "The type of change that occurred on the node, e.g., CREATE, UPDATE, DELETE."
},
{
"Name": "change_timestamp",
"Type": "timestamp",
"Comment": "The time when the change (created/updated/deleted) on the node is pushed to the queue for snapshotting."
},
{
"Name": "change_user_id",
"Type": "bigint",
"Comment": "The unique identifier of the user who made the change to the node."
},
{
"Name": "snapshot_timestamp",
"Type": "timestamp",
"Comment": "The time when the snapshot was taken. (It is usually after the change happened)."
},
{
"Name": "id",
"Type": "bigint",
"Comment": "The unique identifier of the node."
},
{
"Name": "benefactor_id",
"Type": "bigint",
"Comment": "The identifier of the (ancestor) node which provides the permissions that apply to this node. Can be the id of the node itself."
},
{
"Name": "project_id",
"Type": "bigint",
"Comment": "The project where the node resides. It will be empty for the change type DELETE."
},
{
"Name": "parent_id",
"Type": "bigint",
"Comment": "The unique identifier of the parent in the node hierarchy."
},
{
"Name": "node_type",
"Type": "string",
"Comment": "The type of the node. Allowed node types are : project, folder, file, table, link, entityview, dockerrepo, submissionview, dataset, datasetcollection, materializedview, virtualtable."
},
{
"Name": "created_on",
"Type": "timestamp",
"Comment": "The creation time of the node."
},
{
"Name": "created_by",
"Type": "bigint",
"Comment": "The unique identifier of the user who created the node."
},
{
"Name": "modified_on",
"Type": "timestamp",
"Comment": "The most recent change time of the node."
},
{
"Name": "modified_by",
"Type": "bigint",
"Comment": "The unique identifier of the user who last modified the node."
},
{
"Name": "version_number",
"Type": "bigint",
"Comment": "The version of the node on which the change occurred, if applicable."
},
{
"Name": "file_handle_id",
"Type": "bigint",
"Comment": "The unique identifier of the file handle if the node is a file, null otherwise."
},
{
"Name": "name",
"Type": "string",
"Comment": "The name of the node."
},
{
"Name": "is_public",
"Type": "boolean",
"Comment": "If true, READ permission is granted to all the Synapse users, including the anonymous user, at the time of the snapshot."
},
{
"Name": "is_controlled",
"Type": "boolean",
"Comment": "If true, an access requirement managed by the ACT is set on the node."
},
{
"Name": "is_restricted",
"Type": "boolean",
"Comment": "If true, a terms-of-use access requirement is set on the node."
},
{
"Name": "effective_ars",
"Type": "array<bigint>",
"Comment": "The list of access requirement ids that apply to the entity at the time the snapshot was taken."
},
{
"Name": "snapshot_date",
"Type": "date",
"Comment": "The data is partitioned for fast and cost effective queries. The snapshot_timestamp field is converted into a date and stored in the snapshot_date field for partitioning. The date should be used as a condition (WHERE CLAUSE) in the queries.",
"PartitionKey": "Partition (0)"
}
]
In order to include file information in the nodesnapshots table for a dataset, four columns—namely, list_of_items, count_of_items, size_of_items and checksum_of_items—need to be added to the table.
New Columns
list_of_items : The list of files of a dataset with id and version.
{ "Name": "list_of_items", "Type": "ARRAY < STRUCT <entityId: string,versionNumber: bigint>>", "Comment": "The list of files of a dataset." },
count_of_items : The total number of files in a dataset.
{ "Name": "count_of_items", "Type": "bigint", "Comment": "The total number of files in a dataset." }
size_of_items : The sum of size of all the files of a dataset.
4 checksum_of_items : The md5 of concatenate sorted md5 of all the files of a dataset.
Additional file information
Additional details about the files can be retrieved by performing a join between the nodesnapshot table and the filesnapshots table like storage location, file type, name of file etc.