Table Schema Changes

Use Cases

Schema Change no Rows

For a table the already exits with a schema and rows:

User selects the edit schema button and is presented with the edit schema dialog
User adds a new column.
User deletes an existing column
User change the type of an existing column
The user moves the newly added column to the top of the list.
User selects save.

All changes from 2 - 5 should be saved in an atomic operation. This includes column additions, deletes, updates, and column order changes.

Schema Change with CSV Upload

For a table that already with a schema and rows:

User selects a CSV from to upload to the table.
User is presented with a preview of the upload CSV schema and informed that CSV schema does not match that of the table. The preview is similar to the edit schema column list from the edit schema dialog and contains the columns found in the CSV.
From the above list, the user removes one column
From the above list, the user renames a column.
User selects the upload button.

All changes from 3 - 5 should be saved in an atomic operation. This includes, column additions, deletes, updates, column order changes. Since the new schema does not exactly match the CSV, a mapping between column ID and column indices in the CSV must be provided. The update should also include row additions, deletes and updates from the CSV.

Delete column

The user wishes to clear the contents of a column. They do this by doing the following:

Deleting the column from the schema using a command line client.
Add the same column ID back to the table at the same index using a command line client.

All cells in the the column from above should be empty or set to the default of the column.

There are a number of use cases where we want table schema changes as part of the lifecycle of a table. Most requested is certainly the easy case of increasing the size of a string column. More complex is changing a string column to an int column or even changing a string column to a file handle. The other use case is where a partially different dataset is appended to an existing table and it magically merges and rinses the data to the new table format.

Magic aside, I propose the following:

- in addition to the current schema changes possible (currently, you can update a table entity with a new set of columns, and we presume that non-matching column ids mean you meant to append or delete those columns) we add a new call to explicitly change the columns of a table. A structure like:

{

tableId = 1,

[

{ change = DELETE, columnId = 12 },

{ change = APPEND, columnId = 13 },

{ change = CONVERT, fromColumnIid = 14}, toColumnId = 15 }

]

}

is passed in to an async call (because validation can take a long time), and the result if an error occurred is a (truncated!) list of row values that did not convert. Conversion only happens if all values can be converted to the new type.

- for csv uploads we add this same change list to the upload request, which allows for creation or modification of a table at the same time the data is appended.

Internally, we currently store rows as a series of rowset deltas. We augment that by adding column modifications to that same list of deltas, so that we can replay the data and schema changes in the same order they were received in the first place (this implies that any changes made to the columns allow for equivalent SQL modification of a table, which I don't think poses any restrictions)

Some questions:

1. should we allow nulling out of non-convertable cell values (default answer: no, too dangerous)

2. should we allow column modifications as part of the CSV, ie special formatted lines at the top of the csv, before (or instead) of the header. (default answer: no, it makes the csv non-standard)

Let me know if you have questions, concerns, or if this needs a meeting