File: TODO.md

package info (click to toggle)
python-headerparser 0.5.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 404 kB
  • sloc: python: 3,133; makefile: 6; sh: 4
file content (161 lines) | stat: -rw-r--r-- 7,458 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
- Should string `default` values be passed through `type` etc. like in
  argparse?
- Rethink how the original exception data is attached to `FieldTypeError`s
    - Include everything from `sys.exc_info()`?
- Rename `NormalizedDict.normalized_dict()` to something that doesn't imply it
  returns a `NormalizedDict`?
- Add docstrings to private classes and attributes

- Write more tests
    - different header name normalizers (identity, hyphens=underscores,
      titlecase?, etc.)
    - `add_additional`
        - calling `add_additional` multiple times (some times with
          `allow=False`)
        - `add_additional(False, extra arguments ...)`
        - `add_additional` when a header has a `dest` that's just a normalized
          form of one of its names
    - calling `add_field`/`add_additional` on a `HeaderParser` after a previous
      call raised an error
    - scanning & parsing Unicode
    - normalizer that returns a non-string
    - non-string keys in `NormalizedDict` with the default normalizer
    - equality of `HeaderParser` objects
    - Test that `HeaderParser.parse_stream()` won't choke on non-string inputs
    - passing scanner options to `HeaderParser`
    - scanning files not opened in universal newlines mode

- Improve documentation & examples
    - Contrast handling of multi-occurrence fields with that of the standard
      library
    - Draw attention to the case-insensitivity of field names when parsing and
      when retrieving from the dict
    - Give examples of custom normalization (or at least explain what it is and
      why it's worth having)
    - Add `action` examples
    - Add example recipes to the documentation of `HeaderParser`s for common
      mail-like formats
    - Write more user-friendly documentation that goes through `HeaderParser`
      feature by feature like `attrs`' documentation


Features
========
- Add some sort of handling for "From " lines
    - Give `NormalizedDict` a `from_line` attribute
    - Give the scanner a `from_line_regex` parameter; if the first line of a
      stanza matches the regex, it is assumed to be a "From" line
    - Create a "`SpecialHeader`" enum with `FromLine` and `Body` values for use
      as the first element of `(header, value)` pairs yielded by the scanner
      representing "From " lines and bodies
        - Use the enum values as keys in `NormalizedDict`s instead of having
          dedicated `from_line` and `body` attributes?
    - Give the parser an option for requiring a "From " line
    - Export premade regexes for matching Unix mail "From " lines, HTTP
      request lines, and HTTP response status lines

- Write an entry point for converting RFC822-style files/headers to JSON
    - name: `mail2json`? `headers2json`?
    - include options for:
        - parsing multiple stanzas into an array of JSON objects
        - setting the key name for the "message body"
        - handling of multiple occurrences of the same header in a single
          stanza; choices:
            - raise an error
            - combine multi-occurrence headers into an array of values
            - use an array of values for all headers regardless of multiplicity
              (default?)
            - output an array of `{"header": ..., "value": ...}` objects
        - handling of non-ASCII characters and the various ways in which they
          can be escaped
        - handling of "From " lines (and/or other non-header headers like the
          first line of an HTTP request or response?)
        - handling of header lettercases?

Scanning
--------
- Give the scanner options for:
    - definition of "whitespace" for purposes of folding (standard: 0x20 and
      TAB)
    - line separator/terminator (default: CR, LF, and CRLF; standard: only
      CRLF, with lone CR and LF being obsolete)
    - using Unicode definitions of line endings and horizontal whitespace
    - stripping leading whitespace from folded lines? (standard: no)
    - handling "From " lines and the like
    - ignoring all blank lines?
    - comments? (cf. robots.txt)
    - internationalization of header names
    - treating `---` as a blank line?
    - Error handling:
        - header lines without a colon or indentation (options: error, header
          with empty value, or start of body)
        - empty header name (options: error, header with empty name, look for
          next colon, or start of body)
        - all-whitespace line (considered obsolete by RFC 5322)

Parsing
-------
- Include utility callables for header types:
    - RFC822 dates, addresses, etc.
    - Content-Type-style "parameterized" headers
        - Include an `object_pairs_hook` for the parameters?
        - cf. `cgi.parse_header()`
    - internationalized strings
    - converting lines with just '.' to blank lines
    - Somehow support the types in `email.headerregistry`
    - Provide a `Normalizer` class with options for casing, trimming
      whitespace, squashing whitespace, converting hyphens and underscores to
      the same character, squashing hyphens & underscores, etc.
    - unfolding if & only if the first line of the value contains any
      non-whitespace? (cf. most multiline fields in Debian control files)
    - DKIM headers?
    - removing RFC 822 comments?
    - comma-and-space-separated lists?
        - cf. `urllib.request.parse_http_list()`?

- New `add_field` and `add_additional` options to add:
    - `default_action=callable` for defining what to do when a header is absent
    - `multiple_type` and `multiple_action` — like `type` and `action`, but
      called on a list of all values encountered for a `multiple` field
    - `i18n=bool` — turns on decoding of internationalized mail headers before
      passing to `type` (Do this via a custom type instead?)

- Give `add_additional` an option for controlling whether to normalize
  additional header names before adding them to the dict?

- Requiring/forbidding nonempty/non-whitespace bodies

- Add public methods for removing, inspecting, & modifying header definitions
    - Make the `body`, `scanner_opts`, etc. attributes public

- Support constructing a complete `HeaderParser` in a single expression from a
  `dict` rather than having to make multiple calls to `add_field`
    - Support converting a `HeaderParser` instance to such a `dict`

- Support modifying a `HeaderParser`'s field definitions after they're defined?

- Allow two different named fields to have the same `dest` if they both have
  `multiple=True`? (or both `multiple=False`?)

- Give `add_additional` an argument for putting all additional fields in a
  given subdict (or a presupplied arbitrary mapping object?) so that named
  fields can still use custom dests?

- Give parsers a way to store parsed fields in a presupplied arbitrary mapping
  object (or one created from a `dict_factory`/`dict_cls` callable?) instead of
  creating a new NormalizedDict?

- Give `HeaderParser` an option (`body_key`?) for storing the body in a given
  `dict` key

- Create a `BODY` token to use as a `dict` key for storing bodies instead of
  storing them as an attribute?

- Add an option/method for ignoring & discarding any unknown/"additional"
  fields

- Add handling for fields that can either occur in the header or be the body
  (e.g., "Description" in Python packaging METADATA)

- Require scanner options to be passed to `HeaderParser`'s constructor in a
  `scanner_opts={}` `dict` instead of as `**kwargs`