Skip to content

Python operators (e.g. Sort) crash on integer columns containing nulls: pandas promotes them to float, failing strict output schema validation #5935

Description

@eugenegujing

What happened?

When a CSV column holds integer-looking values but also contains missing values, the workflow crashes inside any pandas-based Python operator (e.g. Sort).

Root cause chain:

  1. CSV File Scan auto-infers such a column as integer (inferSchemaFromRows in AttributeTypeUtils.scala), and there is no per-column type override in the UI.

  2. Python operators run on pandas. pandas has a hard rule: an integer column that contains any NaN is automatically up-cast to float64 (an int column cannot hold NaN). So 121 becomes 121.0.

  3. On output, the Python worker validates each tuple against the declared schema, which still says INTEGER. The actual value is a float, so it raises:

    TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead.
    
    File ".../core/models/tuple.py", line 361, in validate_schema (called from finalize -> on_finish)
    

This affects every "integer column that also has missing values" — the user must hit the error, find the column, and manually cast it. In our dataset (diabetes.csv) 4 columns are affected (weight, waist, hip, time.ppn).

Expected:
Integer columns containing nulls should be handled gracefully instead of crashing. Either:

  • (a) the Python worker's schema validation should coerce an integral float (e.g. 119.0) back to INTEGER and NaN to null, or
  • (b) CSV File Scan should infer a null-containing integer column as DOUBLE, or
  • (c) the UI should expose a per-column type override on CSV File Scan.

Current workaround: insert a Type Casting operator and manually cast every affected integer column to double. This works but is manual and error-prone (casting to integer instead of double silently reproduces the bug).

How to reproduce?

  1. Prepare a CSV with an integer-valued column that contains at least one empty cell, e.g. diabetes.csv where weight is all integers except one blank.
  2. Build workflow: CSV File Scan -> Sort. In Sort, sort by any column (e.g. age).
  3. Run the workflow.
  4. The Sort operator fails on finish with:
    TypeError: Unmatched type for field 'weight', expected AttributeType.INT, got 119.0 (<class 'float'>) instead.
    

Workaround that fixes it:
CSV File Scan -> Type Casting (cast weight/waist/hip/time.ppn -> double) -> Sort, then re-run.

Version/Branch

1.3.0-incubating-SNAPSHOT (main)

Commit Hash (Optional)

No response

What browsers are you seeing the problem on?

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions