Objective
Design and implement the project's Intermediate Representation (IR), the shared data model that bridges document understanding and information extraction.
The IR is designed as a structured curation workspace rather than a database schema. It preserves reported scientific observations, provenance, uncertainty, and transformation history while remaining suitable for downstream extraction, validation, and export.
Planned Work
IR Schema
Implement immutable Pydantic models for:
- Citation
- Site
- Treatment
- Species
- Management events
- Observation
Observation Model
The observation model will preserve:
- reported value
- reported units
- converted value (when applicable)
- statistical encoding
- temporal information
- provenance
- confidence
- observation level
- unresolved status
Reported values will never be overwritten by derived values.
Provenance
Every extracted entity must retain:
- originating document object
- page
- section
- source span
- table cell or figure reference (when applicable)
Uncertainty
Support explicit uncertainty representation including:
- bounded date intervals
- unresolved fields
- confidence metadata
Calibration Protocol
Implement the protocol-specific design decisions discussed during project planning, including:
- date min/max representation
- observation-level distinction between treatment means and aggregated summaries
Deliverables
- Complete IR schema
- Pydantic validation
- Comprehensive test suite
- Fixture IR instances for all five ground-truth papers
Objective
Design and implement the project's Intermediate Representation (IR), the shared data model that bridges document understanding and information extraction.
The IR is designed as a structured curation workspace rather than a database schema. It preserves reported scientific observations, provenance, uncertainty, and transformation history while remaining suitable for downstream extraction, validation, and export.
Planned Work
IR Schema
Implement immutable Pydantic models for:
Observation Model
The observation model will preserve:
Reported values will never be overwritten by derived values.
Provenance
Every extracted entity must retain:
Uncertainty
Support explicit uncertainty representation including:
Calibration Protocol
Implement the protocol-specific design decisions discussed during project planning, including:
Deliverables