Background
ben (Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly like van or von. It was removed from PREFIXES in v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.
Proposed approach
A case-sensitive heuristic in is_prefix(): treat ben as a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercase ben is a strong signal it's a particle; in "Alex Ben Johnson" the capitalized Ben signals a given name.
This is consistent with the existing precedent in is_an_initial(), which uses original casing to distinguish initials from other tokens.
Why it's non-trivial
is_prefix() is called from five places in parser.py:
- line 250 — initials computation
- line 448 —
_split_last() for last_base/last_prefixes
- line 1054 — main prefix-join loop during parsing
- line 1075 — chained prefix lookahead
- line 1106 —
cap_word() during capitalization
Making is_prefix() case-sensitive globally would break the capitalization path (line 1106), where a capitalized Van in an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casing ben only in the parse-flow call sites, not in cap_word — would work but requires more surgical changes.
Workaround
Users with Hebrew/Arabic name datasets can add it themselves:
from nameparser.config import CONSTANTS
CONSTANTS.prefixes.add('ben')
Background
ben(Hebrew/Arabic "son of") functions as a last-name prefix particle in names like "Ahmad ben Husain", exactly likevanorvon. It was removed fromPREFIXESin v0.2.5 because it conflicts with the common English given/middle name "Ben" (short for Benjamin) — e.g. "Alex Ben Johnson" would incorrectly eat "Ben" as a prefix.Proposed approach
A case-sensitive heuristic in
is_prefix(): treatbenas a prefix only when it appears already lowercase in an otherwise mixed-case name. In "Ahmad ben Husain" the lowercasebenis a strong signal it's a particle; in "Alex Ben Johnson" the capitalizedBensignals a given name.This is consistent with the existing precedent in
is_an_initial(), which uses original casing to distinguish initials from other tokens.Why it's non-trivial
is_prefix()is called from five places inparser.py:_split_last()forlast_base/last_prefixescap_word()during capitalizationMaking
is_prefix()case-sensitive globally would break the capitalization path (line 1106), where a capitalizedVanin an all-caps input being normalized needs to still be recognized as a prefix and lowercased. A narrower fix — special-casingbenonly in the parse-flow call sites, not incap_word— would work but requires more surgical changes.Workaround
Users with Hebrew/Arabic name datasets can add it themselves: