Skip to content

Support 'ben' as a prefix particle for Hebrew/Arabic names #183

Description

@derek73

Background

ben (Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly like van or von. It was removed from PREFIXES in v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.

Proposed approach

A case-sensitive heuristic in is_prefix(): treat ben as a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercase ben is a strong signal it's a particle; in "Alex Ben Johnson" the capitalized Ben signals a given name.

This is consistent with the existing precedent in is_an_initial(), which uses original casing to distinguish initials from other tokens.

Why it's non-trivial

is_prefix() is called from five places in parser.py:

  • line 250 — initials computation
  • line 448_split_last() for last_base/last_prefixes
  • line 1054 — main prefix-join loop during parsing
  • line 1075 — chained prefix lookahead
  • line 1106cap_word() during capitalization

Making is_prefix() case-sensitive globally would break the capitalization path (line 1106), where a capitalized Van in an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casing ben only in the parse-flow call sites, not in cap_word — would work but requires more surgical changes.

Workaround

Users with Hebrew/Arabic name datasets can add it themselves:

from nameparser.config import CONSTANTS
CONSTANTS.prefixes.add('ben')

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions