Update Synapse tables for integer types

Description

Tried to download a Synapse table as a Pandas data frame and then upload it back to Synapse.

The following example succeeded as expected,

whereas, if we add one particular row to the query with a missing integer value on the "heartAgeDataBloodGlucose" column, the upload would fail.

The Synapse error message is like "Value at [9,9] was not a valid INTEGER. For input string: "100.0"".

The reason is that, Pandas integer type does not support NaN. When a column has a missing value, it's type falls back to float. Then the string representation of the float value is incompatible with Synapse Table's Int type.

Environment

None

Activity

Show:
Chris Bare
March 7, 2016, 8:27 PM

It doesn't solve the issue above, but will probably help to pass in dtypes to pd.read_csv in CsvFileTable.asDataFrame to ensure that FileHandleID columns are not inferred to be either integers or floats, but stay as strings.

We may want to allow the user to pass in their own dtypes argument.

Still, numpy int64s don't have a representation for missing data, so we can't help loosing that information.

Ziming Dong
June 15, 2017, 9:00 PM

A possible solution is to check the dtype of very dataframe column and collect a set of column names with floats. And then check the schema of the synapse table to gather a set of INTEGER column names. The intersection of these 2 sets will be the set of INTEGER column containing missing values. Since we can't convert NaN to int, we will have to first convert all the columns to strings and perform a split('.') on each string and use the left most value. We shouldn't need to worry about the right-hand side of the float's decimal point since the values were supposedly converted from integers anyways (should all be '.0').

The issue is that in the Schema, we are only given columnIds meaning that we must fetch the column type data from Synapse. This means we can only resolve the float/integer inconsistency during store() via _synapse_store(). This means that a float instead of an int will still be written to the user's CSV after from_dataframe() is called and can only be corrected after a call to store().

Since this int to float edge case really only happens when the user fetches Table data from Synapse and then call asDataframe(), I propose appending an extra attribute to the returned dataframe which contains the column name and Synapse type information(df._synapse_column_types). Then in from_dataframe, we can check for the existence of that field and try to resolve the types.

One possible downside is if the user decided to change the schema of the synapse table, we would be using outdated column information. I'm not sure how much work is involved with modifying the schema of a table that already contains data or if users even do that very often.

Bruce Hoff
January 23, 2018, 4:17 PM

Installed the 'develop' branch:

Then queried a table with an integer column having a missing value, and stored it:

Assignee

Ziming Dong

Reporter

Bruce Hoff

Labels

Validator

Bruce Hoff

Development Area

None

Release Version History

None

Components

Fix versions

Affects versions

Priority

Critical
Configure