File: CLAUDE.md

package info (click to toggle)
normality 3.1.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 188 kB
  • sloc: python: 1,311; makefile: 18
file content (35 lines) | stat: -rw-r--r-- 1,428 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
normality is a performance-sensitive text normalisation library. It is a hot
path in data pipelines that process millions of entity values from sanctions
lists, PEP databases, and corporate registries.

## Performance

- This code runs on every property value in large entity graphs. Avoid
  introducing per-character Python loops where a regex or str.translate() can
  do the work.
- Prefer compiled regexes and precomputed lookup structures (frozenset, dict,
  translate tables) over inline computation.
- When changing cleaning or normalisation logic, consider benchmarking against
  a large realistic corpus before and after.

## Text and script coverage

normality must handle text (often names) and addresses in all scripts used by
world-wide official languages, with particular focus on data found in sanctions
and PEP (Politically Exposed Persons) screening:

- Latin (English, French, Spanish, Portuguese, German, Dutch, Polish, Swedish,
  Norwegian, Danish, Finnish, Estonian, Lithuanian, Hungarian, Turkish)
- Cyrillic (Russian, Ukrainian)
- Arabic
- CJK (Simplified Chinese, Japanese, Korean)

Changes to character handling, unicode normalisation, or transliteration must
be verified not to break any of these scripts.

## Python

- Generate fully-typed, minimal Python code.
- Always explicitly check `if x is None:`, not `if x:`
- Run tests using `pytest tests/`
- Run typechecking using `mypy --strict normality`