Skip to content

Add configurable record_transform hook for post-query record processing#1237

Open
aawemove wants to merge 1 commit into
geopython:masterfrom
aawemove:feature/record-transform-hook
Open

Add configurable record_transform hook for post-query record processing#1237
aawemove wants to merge 1 commit into
geopython:masterfrom
aawemove:feature/record-transform-hook

Conversation

@aawemove

Copy link
Copy Markdown

Overview

pycsw currently offers no unified intervention point to modify records before they are returned to clients. Existing mechanisms are either format-specific (XSLT → CSW XML only, Jinja2 → OGC API HTML only) or absent entirely from the OGC API pipeline.

This PR introduces a server.record_transform configuration option that accepts a file path or dotted module name pointing to a user-supplied Python callable. The callable receives the SQLAlchemy ORM record object after database retrieval and before serialization, and must return it (optionally modified). A single hook covers all supported protocols and output formats:

Format Covered
OGC API Records — JSON
OGC API Records — HTML
OGC API Records — application/xml
CSW 2.0.2 — all output schemas
CSW 3.0 — all output schemas

pycsw serves metadata records in multiple formats, each with its own transformation mechanism:

  • CSW XML — supports XSLT stylesheets via xslt: configuration
  • OGC API HTML — supports customisation via Jinja2 templates
  • OGC API JSON — no official transformation option exists
  • OGC API application/xml — no transformation option exists

Operators who need to modify record content before delivery — for example, stripping internal keywords added at ingest time, redacting sensitive fields, or normalising values — must implement separate solutions for each format. There is no single point where a transformation can be applied uniformly across all protocols and output formats.

This PR closes that gap with a single hook that runs after database retrieval and before any format-specific serialization, making it unnecessary to maintain parallel format-specific workarounds.

Related Issue / Discussion

Additional Information

Full documentation for this feature is available in docs/record-transform.rst, covering configuration, the two-representation model (individual fields vs. record.xml), format-specific serialization behaviour, an ISO 19139 example with raw XML rewriting, Docker volume injection, dotted module path support, and security considerations.

Unit tests are located in tests/unittests/test_record_transform.py and cover loader behaviour (None/empty input, file loading, missing callable, missing module) as well as functional behaviour (keyword filtering, null keywords, identity passthrough).

Example transform.py

from lxml import etree

_PREFIXES = ('source:', 'catalog:', 'organisation:', 'transaction:', 'sub_organisation:')

_NS = {
    'gmd': 'http://www.isotc211.org/2005/gmd',
    'gco': 'http://www.isotc211.org/2005/gco',
}

def record_transform(record):
    if record.keywords:
        record.keywords = ','.join(
            kw for kw in record.keywords.split(',')
            if not kw.strip().startswith(_PREFIXES)
        )

    if record.xml:
        try:
            xml_in = record.xml if isinstance(record.xml, bytes) else record.xml.encode('utf-8')
            root = etree.fromstring(xml_in)
            for md_kw in root.findall('.//gmd:MD_Keywords', _NS):
                for kw_el in md_kw.findall('gmd:keyword', _NS):
                    cs = kw_el.find('gco:CharacterString', _NS)
                    if cs is not None and cs.text and cs.text.strip().startswith(_PREFIXES):
                        md_kw.remove(kw_el)
                if not md_kw.findall('gmd:keyword', _NS):
                    parent = md_kw.getparent()
                    if parent is not None:
                        parent.getparent().remove(parent)
            record.xml = etree.tostring(root, encoding='unicode')
        except Exception as e:
            print(f'record_transform XML error: {e}')

    return record

Contributions and Licensing

(as per https://github.com/geopython/pycsw/blob/master/CONTRIBUTING.rst#contributions-and-licensing)

  • I'd like to contribute feature/record-transform-hook to pycsw. I confirm that my contributions to pycsw will be compatible with the pycsw license guidelines at the time of contribution.
  • I have already previously agreed to the pycsw Contributions and Licensing Guidelines

Introduces a configurable server.record_transform setting that accepts a file path or dotted module name. The referenced module must define a callable named record_transform which receives the SQLAlchemy ORM record object after database retrieval and must return it (optionally modified).

The hook is applied before serialization across all supported protocols and output formats:
- OGC API Records (JSON, HTML, application/xml)
- CSW 2.0.2 and CSW 3.0 (all output schemas and profiles)

Common use cases include filtering internal keywords, redacting sensitive field values, and normalizing output before it reaches clients.

Adds unit tests and documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant