What happened?
When a CSV column holds integer-looking values but also contains missing values, the workflow crashes inside any pandas-based Python operator (e.g. Sort).
Root cause chain:
-
CSV File Scan auto-infers such a column as integer (inferSchemaFromRows in AttributeTypeUtils.scala), and there is no per-column type override in the UI.
-
Python operators run on pandas. pandas has a hard rule: an integer column that contains any NaN is automatically up-cast to float64 (an int column cannot hold NaN). So 121 becomes 121.0.
-
On output, the Python worker validates each tuple against the declared schema, which still says INTEGER. The actual value is a float, so it raises:
TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead.
File ".../core/models/tuple.py", line 361, in validate_schema (called from finalize -> on_finish)
This affects every "integer column that also has missing values" — the user must hit the error, find the column, and manually cast it. In our dataset (diabetes.csv) 4 columns are affected (weight, waist, hip, time.ppn).
Expected:
Integer columns containing nulls should be handled gracefully instead of crashing. Either:
- (a) the Python worker's schema validation should coerce an integral float (e.g.
119.0) back to INTEGER and NaN to null, or
- (b) CSV File Scan should infer a null-containing integer column as DOUBLE, or
- (c) the UI should expose a per-column type override on CSV File Scan.
Current workaround: insert a Type Casting operator and manually cast every affected integer column to double. This works but is manual and error-prone (casting to integer instead of double silently reproduces the bug).
How to reproduce?
- Prepare a CSV with an integer-valued column that contains at least one empty cell, e.g. diabetes.csv where
weight is all integers except one blank.
- Build workflow: CSV File Scan -> Sort. In Sort, sort by any column (e.g.
age).
- Run the workflow.
- The Sort operator fails on finish with:
TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead.
Workaround that fixes it:
CSV File Scan -> Type Casting (cast weight/waist/hip/time.ppn -> double) -> Sort, then re-run.
Version/Branch
1.3.0-incubating-SNAPSHOT (main)
Commit Hash (Optional)
No response
What browsers are you seeing the problem on?
No response
Relevant log output
What happened?
When a CSV column holds integer-looking values but also contains missing values, the workflow crashes inside any pandas-based Python operator (e.g. Sort).
Root cause chain:
CSV File Scan auto-infers such a column as
integer(inferSchemaFromRowsinAttributeTypeUtils.scala), and there is no per-column type override in the UI.Python operators run on pandas. pandas has a hard rule: an integer column that contains any NaN is automatically up-cast to float64 (an int column cannot hold NaN). So
121becomes121.0.On output, the Python worker validates each tuple against the declared schema, which still says INTEGER. The actual value is a float, so it raises:
This affects every "integer column that also has missing values" — the user must hit the error, find the column, and manually cast it. In our dataset (diabetes.csv) 4 columns are affected (weight, waist, hip, time.ppn).
Expected:
Integer columns containing nulls should be handled gracefully instead of crashing. Either:
119.0) back to INTEGER and NaN to null, orCurrent workaround: insert a Type Casting operator and manually cast every affected integer column to
double. This works but is manual and error-prone (casting tointegerinstead ofdoublesilently reproduces the bug).How to reproduce?
weightis all integers except one blank.age).Workaround that fixes it:
CSV File Scan -> Type Casting (cast weight/waist/hip/time.ppn -> double) -> Sort, then re-run.
Version/Branch
1.3.0-incubating-SNAPSHOT (main)
Commit Hash (Optional)
No response
What browsers are you seeing the problem on?
No response
Relevant log output