Skip to content

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183

Open
dimitri-yatsenko wants to merge 5 commits into
mainfrom
feat/provenance-trinity-spec
Open

docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance#183
dimitri-yatsenko wants to merge 5 commits into
mainfrom
feat/provenance-trinity-spec

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3. Implementation against this spec follows in three sub-tasks (T2.2.a/b/c):

  • `Diagram.trace(table_expr)` (#1423) — upstream mirror of `Diagram.cascade()`. Walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the just-merged Cascade Specification.
  • `self.upstream` (#1424) — property on `AutoPopulate` that the framework sets to `Diagram.trace(self & key)` before each `make()`. Pre-restricted ancestor access for ergonomic, provenance-safe reads.
  • `dj.config["strict_provenance"]` (#1425) — runtime flag (default `False`). When `True`, gates `make()` so reads must go through `self.upstream` and writes must target `self` or `self`'s Parts with key-consistent primary keys.

Spec structure

Section What it covers
Why this exists The convention/enforcement gap; why three pieces are needed together
Concepts trace as cascade's mirror; OR convergence; the `make()` provenance boundary; shared upward rules
§1 `Diagram.trace` Signature, behavior, `trace[T]` indexing (class + string), `counts()`/`heading()`/`iter`, worked examples
§2 `self.upstream` Construction lifecycle, allowed table set, per-`key` behavior, examples
§3 `strict_provenance` Config key, read/write enforcement tables, key-consistency rule, opt-in rationale, compliant + violating examples
Integration End-to-end strict-mode example, properties guaranteed
Migration path Staged adoption from default-off to full enforcement
Out of scope Static analysis, default flip, downstream provenance-metadata persistence

Examples use core DataJoint types (`int32`, `float64`, `longblob`) per project convention.

Why a spec-first approach

The three sub-tasks (T2.2.a/b/c) form a tightly-coupled feature where partial shipping is worse than not shipping (you can't enforce against an upstream view that doesn't exist; you can't expose `self.upstream` ergonomically without the trace primitive). Writing the spec first locks the design contract that the three PRs implement against, surfaces design questions early, and gives reviewers a single coherent artifact to evaluate before the code arrives.

Cross-links

Sequencing

Reviewable now as a design document. The three implementation PRs (datajoint-python) will land against this spec. Order:

  1. This PR (spec) — review and merge
  2. T2.2.a: `Diagram.trace()` implementation (depends on #1468 cascade-rules landing first)
  3. T2.2.b: `self.upstream` implementation (depends on T2.2.a)
  4. T2.2.c: `strict_provenance` implementation (depends on T2.2.b)

Test plan

  • `mkdocs serve` renders the new spec under Reference → Specifications → Data Operations
  • Cross-links resolve (cascade.md, autopopulate.md, diagram.md, entity-integrity.md, the three issues)
  • Examples use core DataJoint types
  • Read enforcement table and write enforcement table are precise enough to be implementable as-is

…venance

Detailed normative specification for the canonical provenance trinity
landing in DataJoint 2.3:

- Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(),
  walks the FK graph from a restricted seed to every ancestor with OR
  convergence. Reuses the upward propagation rules (U1/U2/U3) defined
  in the Cascade Specification. Supports trace[TableClass] returning a
  pre-restricted QueryExpression, and trace[str] returning a FreeTable.
- self.upstream — property on AutoPopulate that the framework sets to
  Diagram.trace(self & key) before each make() invocation. Pre-restricted
  ancestor access for ergonomic, provenance-safe reads.
- dj.config["strict_provenance"] — runtime flag (default False). When
  True, gates make() so reads must go through self.upstream and writes
  must target self or self's Parts with key-consistent primary keys.

Spec structure:
- Why this exists — the convention/enforcement gap, why three pieces.
- Concepts — trace as cascade's mirror, OR convergence, the make()
  provenance boundary, why upward rules are shared with cascade.
- §1 Diagram.trace — signature, behavior, indexing (class + string),
  counts() / heading() / iter, worked examples.
- §2 self.upstream — construction lifecycle, allowed table set,
  per-key behavior, examples.
- §3 strict_provenance — config key, read enforcement table, write
  enforcement table, key-consistency rule, opt-in rationale, examples
  of compliant + violating make() bodies.
- Integration — end-to-end strict-mode example, properties guaranteed.
- Migration path — staged adoption from on-default-off to enforcement.
- Out of scope — static analysis, default flip, downstream provenance
  metadata persistence.

Examples use core DataJoint types (int32, float64, longblob) per
project convention. Cross-links to cascade.md (shared upward rules),
autopopulate.md, diagram.md, entity-integrity.md.

Nav entry under Reference > Specifications > Data Operations between
AutoPopulate and Job Metadata.

Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c
(#1425) implementations land against this spec.
@dimitri-yatsenko dimitri-yatsenko marked this pull request as draft June 10, 2026 23:14

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read carefully. This is a thoughtful spec — the three-piece architecture (graph operation + ergonomic surface + enforcement layer) is well-motivated, and the migration path gives teams a concrete 5-step rollout. A few observations worth raising before the implementation work begins:

Context

This is a forward-looking spec for three unshipped features. All three referenced issues (#1423/#1424/#1425) exist as OPEN in datajoint-python. No implementation exists in source yet — grep returned zero matches for load_all_upstream, def trace, self.upstream, strict_provenance across the relevant modules.

Sequencing concern

cascade.md cross-link will be broken until #182 lands. The spec links to cascade.md repeatedly:

These are load-bearing — the upward rules are defined exclusively in #182. Sequencing requirement: #182 must merge first (or batch together).

Observations

1. Forward-looking surface needs a more prominent banner. The current "version-added" admonition at line 11 helps but doesn't convey "implementation lands across three separate PRs — this is the design target, not shipped behavior". An "Implementation status" admonition near the top (like the one #177 eventually adopted) would make this clearer. A reader could mistake the current framing for "available now in 2.3".

2. load_all_upstream is also unshipped. Line 84 references connection.dependencies.load_all_upstream() as if it exists; it doesn't. The spec should note this is also new (or cite the issue/PR where it lands).

3. Dangling "Implementation notes" reference. Line 245 says "the value seen depends on thread-local context — see Implementation notes". There is no "Implementation notes" section in the spec. Either add the section or drop the reference.

4. AND-flavored upstream dismissed too quickly. Line 40: "The AND-flavored upstream analog — 'ancestors that contributed via every path' — is not a useful query and is not provided." Strong claim. Audit / intersection use cases might want it. Softer wording: "… not provided in 2.3" or "… not currently considered useful for typical compute pipelines".

5. Strict-provenance edge cases worth addressing:

  • Does connection.query(raw_sql) get blocked under strict?
  • What about an audit log inside make()? Under strict, a team can't write to AuditLog. Suggesting "log to a Part of self" or "use a temporary escape via context manager" would close this loop.
  • The runtime mechanism — overriding fetch/insert, descriptor, AST analysis? A brief implementation notes paragraph (resolving the dangling reference from #3) would close this.

6. make_kwargs mention without definition. Line 235: "make_kwargs and key are unaffected."make_kwargs isn't defined elsewhere in the spec. Readers unfamiliar with this internal will be confused.

7. Caching/lazy behavior of self.upstream. Line 187 says construction is lazy. If make() accesses self.upstream[Session] twice, are both fetches independent SELECTs, or is there per-call caching? Worth one sentence either way.

8. Concurrency clarification at line 245. "the value seen depends on thread-local context"dj.config["strict_provenance"] reads from a module-level dict, which is global, not thread-local. The implementation might need to make it thread-local but the spec should say so explicitly.

9. self.PartName vs self.PartName() inconsistency. Examples use both forms in different places (e.g., line 257 "Reading self or self.PartName" vs line 309 self.Bin.insert1(...)). Pick one canonical form.

10. Cosmetic: example error message at line 326-329 shows key['subject_id'] (Python expression) rather than the runtime value. Real error messages would have the value substituted in.

Strengths to highlight

  • The cascade↔trace comparison table (line 34) is a clear pedagogical anchor
  • Three-piece argument (line 20-26) makes the case for why all three are needed together
  • "Why opt-in" rationale (lines 285-289) is well-argued — production vs development is the right framing
  • 5-step migration path is concrete and incremental
  • "What is not in this specification" appropriately defers static analysis and default-flip

Holding off formal approval until the cross-link dependency on #182 is resolved and the forward-looking framing is sharpened. The spec content is solid; this is mostly about presentation and sequencing.

@ttngu207

ttngu207 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

I reviewed the spec against the shipped code on the three feature branches. The good news: the trace core, the self.upstream construction, the config key/default/type, write enforcement + key-consistency, and the "check at fetch time" framing are all accurately described. A handful of claims don't match the code, though, and I think they should be reconciled before this merges — a couple would actively mislead a reader into writing code that behaves differently than the spec says.

  1. Read enforcement (the important one). The §3 "Read enforcement" table row — (SomeTable & key).fetch(...) where SomeTable is an ancestor → Blocked — and the first ERROR block in the §3 Examples (with the message "Recording is reachable but read outside self.upstream (strict_provenance=True disallows direct reads from ancestor tables)") describe enforcement the code does not perform. assert_read_allowed only blocks tables outside the allowed set (ancestors + self + Parts); a direct read of an ancestor is allowed, and #1474 documents exactly this limitation in assert_read_allowed's docstring. That row and that example message should be corrected, and the limitation should be documented, so readers don't write make() code expecting direct ancestor reads to raise (they won't).

  2. trace.heading() is listed under §1 "Other methods" but isn't implemented — only counts() and __iter__ exist on the trace branch. Drop it or mark it as future work.

  3. counts() example keys by class objects ({Subject: 1, Session: 1, …}), but the code keys by full-table-name string (`imaging`.`subject`). Also the prose says "each ancestor's count" while the example includes Summary, the seed. Worth reconciling both.

  4. iter(trace) — the "deepest ancestors first" parenthetical is backwards; __iter__ yields parents/roots first (topological order).

  5. Minor: "The Parts of any ancestor … are accessible" overstates the strict-mode allowed set (a Part is only reachable if it's genuinely on the FK path to self); and there's a dangling "see Implementation notes" cross-reference with no such section (the mechanism is a contextvars.ContextVar, not thread-local).

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed against the three feature branches. @ttngu207's five points hold up — independently verified the two that matter most.

The read-enforcement mismatch (line 255) is the load-bearing one. assert_read_allowed on the strict_provenance branch (src/datajoint/provenance.py:82-122) explicitly documents at lines 98-104 that it does not distinguish direct ancestor reads from self.upstream reads — both are allowed if the table is in the allowed set. The spec's "(SomeTable & key).fetch(...) where SomeTable is an ancestor → Blocked" row and the matching §3 ERROR example describe enforcement the code doesn't perform, and would push readers to write make() code expecting reads to raise that in fact won't. Worth fixing before merge.

The other four are quick spec/code alignments: counts() returns dict[str, int] keyed by full table name, not class objects (line 176 example); trace.heading() isn't implemented (line 114 should be dropped or marked future work); iter(trace) yields parents-first per diagram.py:1046, so "deepest ancestors first" on line 115 reads backwards; and the concurrency mechanism is contextvars.ContextVar (provenance.py:22), not thread-local — the dangling "see Implementation notes" cross-reference at line 245 still has no target.

One positive update since my earlier pass: cascade.md landed via #182, so the cross-link now resolves cleanly.

The three feature PRs (#1471, #1473, #1474) referenced by this spec are all still open, so an "Implementation status" banner near the top would still help. Otherwise — get item #1 reconciled and the rest are mechanical fixes.

… align spec with shipped code

Reviewer reconciliation (ttngu207, MilagrosMarin) against the three feature
branches, plus a reframing of the enforcement model:

- strict_provenance: describe the runtime check honestly as a best-effort,
  client-side development guardrail, not an airtight boundary. Raw SQL, other
  clients, and direct DB access bypass it; comprehensive enforcement is a
  code-inspection problem (static where sound, agentic on the platform) and is
  positioned as the roadmap. Softened 'runtime guarantee' / 'provenance-safe by
  construction' language throughout.
- Read enforcement: corrected the table + violating example — a direct read of
  a declared ancestor is Allowed in 2.3 (only undeclared tables are blocked),
  matching assert_read_allowed. Added a note on what the check does/doesn't
  distinguish. Error messages now match provenance.py.
- trace.counts(): keyed by full table name (dict[str,int]), includes the seed.
- Dropped trace.heading() (not implemented).
- iter(trace): parents/roots first (topological), not 'deepest first'.
- Concurrency: contextvars.ContextVar, not thread-local; removed the dangling
  'see Implementation notes' reference.
- Noted load_all_upstream is new; glossed make_kwargs; documented self.upstream
  non-caching; narrowed the ancestor-Parts allowed-set wording.
@dimitri-yatsenko

dimitri-yatsenko commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Pushed a revision (d781814) reconciling the spec with the shipped code on the three feature branches, plus a reframing of the enforcement model.

Enforcement model — the substantive change. The spec previously described strict_provenance as a "runtime guarantee." That overclaims: assert_read_allowed runs from QueryExpression.cursor, so it only sees access through the DataJoint Python client — raw SQL (connection.query), other clients/processes, a DB console, or a BI tool all bypass it. §3 now describes the runtime check honestly as a best-effort development guardrail, not a boundary, with a new "Enforcement model and its limits" subsection. Comprehensive enforcement is positioned as a code-inspection problem (static analysis where sound, agentic platform-side review for the dynamic cases) on the roadmap, not something a client-side query gate can deliver. This resolves the read-enforcement finding at the root.

Addressing the specific points from both of you:

@ttngu207

  1. Read enforcement (the load-bearing one). Corrected. The table row for a direct ancestor read now reads Allowed (only undeclared tables are blocked), matching assert_read_allowed's documented limitation, and the violating example now reads from an undeclared table with the real error message. Added a note on what the check does/doesn't distinguish.
  2. trace.heading() — dropped; it isn't implemented.
  3. counts() — now dict[str, int] keyed by full table name (`imaging`.`subject`), and the prose notes the seed is included alongside ancestors.
  4. iter(trace) — fixed to "parents/roots first (topological)."
  5. Parts / dangling reference — narrowed the allowed-set wording to Parts genuinely on the FK path; the "see Implementation notes" reference is gone, replaced by the actual mechanism (contextvars.ContextVar).

@MilagrosMarin (numbering follows the observations in your review)

  • Concurrency (point 8) — now states the flag is read from global dj.config into a per-call context held in a ContextVar, and that concurrent same-process populates can't use different flag values. Not thread-local.
  • load_all_upstream (point 2) — flagged as new, introduced with trace (Implement Diagram.trace() for upstream restriction propagation datajoint-python#1423).
  • make_kwargs (point 6) — glossed inline (the optional populate(..., make_kwargs=...) passthrough).
  • Caching (point 7) — added a sentence: the trace diagram is built once per make(), but fetch results aren't cached, so two reads of the same ancestor issue two SELECTs.
  • AND-flavored (point 4) — softened to "not provided in 2.3."
  • cascade.md — resolves now that docs(#1429): add Cascade Specification #182 landed. 👍

Deliberately not done: no "Implementation status" banner — instead of tracking that in-doc, this PR will be held from merge until the implementation lands on main, so the spec and shipped behavior stay in lockstep. Two design questions you raised (whether connection.query(raw_sql) should be gated; how to write an audit log under strict mode) are now subsumed by the reframing — both are really "what belongs in the code-inspection layer vs. the runtime guardrail," and I'd rather settle them alongside the implementation than pre-commit here.

…nt CI/CD

Draw the boundary plainly: the open-source core ships only the runtime
guardrail; comprehensive code inspection (static + agentic) is performed by the
DataJoint platform's code-deployment CI/CD process, not the OSS framework.
Applied consistently across the intro, the enforcement-model subsection, the
integration properties, the migration path, and the out-of-scope list.

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The reframing to "best-effort development guardrail" is the right honest positioning, and all the smaller spec/code alignments (read row, counts(), iter(), contextvars, trace.heading(), load_all_upstream, make_kwargs, AND-flavored) are addressed. Comment link fix noted too — thanks.

Approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants