File: DIAGNOSTICS-AND-FORMATTING.md

package info (click to toggle)
debputy 0.1.82
links: PTS, VCS
area: main
in suites: sid
size: 4,796 kB
sloc: python: 67,023; sh: 164; perl: 155; makefile: 57; ansic: 16
file content (365 lines) | stat: -rw-r--r-- 14,767 bytes
parent folder | download | duplicates (3)
from debputy.linting.lint_util import LintState

# Contributing to diagnostics or reformatting (such as `debputy lint`)

This document is aimed at introducing a potential new contributor to the
diagnostics and reformatting framework in `debputy`.

## Prerequisites

The reader is expected to:

 * Be familiar with Python coding
 * Have some basic understanding of diagnostics ("linting") or
   formatting (`debputy reformat`)

## Context and important modules

In `debputy`, linting is built as a part of the language server protocol
(LSP) feature. This means that most of the linting code starts in the
`debputy.lsp` package with a few helper utilities in `debputy.linting`.

Finally, the entry point is in
`debputy.commands.debputy_cmd.lint_and_lsp_cmds.py`, which is in turn
imported by `debputy.commands.debputy_cmd` for setting up the command
line.


The supported languages are defined in `debputy.lsp.languages`. Each
language provides a number of annotated functions that defines the
entry point for the relevant LSP feature. The relevant functions will
be the one annotated with `@lint_diagnostics(_DISPATCH_RULE)` for diagnostics.
For formatting, the functions are the ones annotated with:

 * `@lsp_cli_reformat_document(_DISPATCH_RULE)`
 * `@lsp_format_document(_DISPATCH_RULE)`
 * `@lsp_will_save_wait_until(_DISPATCH_RULE)`

These three should generally delegate to a common base method that does the
actual formatting by converting arguments to a `LintState`. Long term, the
three should be merged so writing formatting for a language requires less
coding.

In either case, the function is invoked at least once per file of this type
being processed. In LSP mode, a file can be processed multiple times as the
file is being edited or saved leading to multiple calls for the same file.


The tests live in `tests/lint_tests/lint` (diagnostics) and
`tests/lint_tests/reformat` (reformatting). For historical reasons,
they both share the prefix `tests/lint_tests`. A better name there is
welcome.

### Restrictions

It is important that the CLI features and the LSP module share the same
code base, so the logic is in one place.

Because the code base is shared with the LSP module, it is important to
**never** access files directly. In the LSP mode, files might have unsaved
content that is not visible in the file system.

Additionally, due to how the LSP module works, the code must always track
very offset and line perfectly and *never* mutate it in place as it will
corrupt the state for future calculations.

For the files that can have multiple formats or versions (such as
`debian/copyright`), the code must start by validating the file is the
correct format first.  The `deb822`-based files have a built-in mechanism
for this. Other formats must roll their own.

## Key entities

Here are the following key entities in `debputy`'s diagnostics module:

 * The `LintState`
 * `TERange` / `TEPosition` vs. Range` / `Position`
 * A `Diagnostic`
 * A `Quickfix`
 * A `TextEdit`

### `LintState`

The `LintState` provides accessors to the diagnostics and reformatting.
It abstracts away many of the language server specific APIs vs. CLI
support to simplify writing the logic. It also simplifies the setup
for testing as a `LintState` has considerably fewer moving parts than
the `LanguageServer` / `DebputyLanguageServer` classes.

The `LintState` will provide you with:

 * `content` / `lines`: The raw contents of the file being processed
   right now.

 * `path` / `doc_uri`: The path or URI to the file being processed.
   Note: There is **no** guarantee that exists or contains the same
   as `content` / `lines`. A file can be open in the editor having
   a full and valid path without ever being stored to disk.

 * `parsed_deb822_file_content`: If the file is `deb822` based,
   then this attribute will contain a parsed version of the file.

#### Diagnostic specific features for `LintState` (linting only)

For diagnostics, the following attributes or members might be
of interest:

 * `emit_diagnostic`: The central pillar of the linting engine as
   is makes the diagnostic available to the user.

 * `slow_iter`: Wraps an iterable in an awaitable that occasionally
   yields. This is mainly useful for the LSP case where `debputy`
   should still respond to editor requests.

 * `source_package` / `binary_packages`: Access to the fields of
   of the `debian/control` file for the `Source` stanza or one
   of the `Package` stanzas.

 * `related_diagnostic_information`: If a diagnostic should
   highlight some related content, this function can be called
   to setup the related ranges. Useful for things like
   "X and Y conflict" cases, etc.

 * `spellchecker`: Provides a spellchecker if the file context
   human-readable test to be spellchecked. A placeholder will
   be provided if spellchecking is disabled, so the code does
   not does have to care for that.

 * `dh_sequencer_data` / `debputy_metadata`: Data related to
   `dh` or `debputy` that might be useful if the diagnostic
   depends on the attributes provided by these.

#### Formatting specific features for `LintState` (reformat only)

 * `position_codec`: Codec for mapping between `debputy` and the editor.
   This is only relevant for `TextEdit` or when creating LSP objects
   directly. See Range / Position for details*.

 * `effective_preference`: The active formatting preference.

### `TERange` / `TEPosition` vs. Range` / `Position`

Positions and ranges are central in everything. All diagnostics
and formatting (`TextEdit`s) involve selecting a range. For
diagnostics, the range decides what the editor (or CLI) should
highlight. For reformatting, the formatting is done by creating
`TextEdit`s and every edit involves a range, which should be
replaced by a piece of text. Here an empty range is used for
inserting text and empty replacement for deleting.

Every range is in turn made up of two positions. A start position
and an end position.

Unfortunately, this gets a bit more complex here, since some more
aspects on how define positions.

#### `python-debian` vs LSP position and ranges

In the diagnostics, almost everything uses `python-debian` position.
They are imported as `TERange` and `TEPosition`. These are always in
"server encoding" (see next section)

However, for a few concern cases in diagnostics and for all `TextEdit`s
in reformatting, the code will need LSP position or ranges. Typically,
the code will start with a `TERange` and then convert it to a
LSP `Range` (`server encoded`) via `te_range_to_lsp`. Ditto for `TEPosition`
to LSP `Position` via `te_position_to_lsp`. Finally, the entity has to
be converted to the client encoding.

Note: `debputy` could swap to LSP position/ranges internally with no
loss of functionality. However, any `deb822` format would still need
to interface with `TERange` and `TEPosition` since those are the positions
used by `python-debian` (the parser for the `deb822` formats).

#### `server encoding` vs. `client encoding`

The language server protocol operates with a client encoding and a server
encoding of the text. The two sides decides their own internal way of
encoding strings. However, when transferring position and ranges, they
agree on one of `UTF-16`, `UTF-32`, or `UTF-8` as the encoding and all
positions are based on that regardless of what the server/client uses.

In the code base, the term `server encoding` is used for positions or
ranges based on Python's string indexes. As an example, the `1:2` in
`"æøå"[1:2]` is a valid Python range for the string and will extract
the letter `ø`. Assuming this constitutes the first line of a file,
then it would be defined as the following LSP Range (`server encoding`)
plus how to convert it to encoding that the editor supports.

```python

import lsprotocol.types as types
from debputy.linting.lint_util import LintState


lint_state: LintState = ...

sever_encoded_range = types.Range(
    types.Position(0, 1),
    types.Position(0, 2),
)
client_encoded_range = lint_state.position_codec.range_to_client_units(
    lint_state.lines,
    sever_encoded_range,
)
```

The general advice is stay in the `python-debian` entities
(`TERange` + `TEPosition`) whenever possible. Then, convert to
LSP format using `te_range_to_lsp` and finally convert to client
range. When naming variables containing LSP entities, there is
a convention to use `_server_units` as suffix for variables
containing server encoded variables.

Caveat: The example above is only valid when the range being
encoded is from the file currently being processed. While it is
the most common case, it will not suffice for cross-file
diagnostics. The `lint_state.lines` should match the file the
range is from.


### `Diagnostic` (linting only)

A `Diagnostic` is used to flag something as incorrect. The `debputy`
code internally uses the Language Server Protocol (LSP) definition of a
diagnostic, because the module aims at providing diagnostics via
the LSP. The linting code does not use the LSP `Diagnostic` directly
(using instead `LintState.emit_diagnostic`), which enables `debputy`
to add some extra features mostly transparently. Nevertheless, the API is
constrained by the LSP spec in some ways and that is generally by design.

Every diagnostic has 5 key pieces of information:

 * The file it applies to. This is generally *implicit* in `debputy`
   hidden inside the `LintState` object itself.

 * A text range in the file, that should be highlighted. Unlike most
   other linters, a simple line number is insufficient. The code must
   identify the *range* to be marked as incorrect (accounting for
   whitespace, comments, etc.).

   The `emit_diagnostics` expects a `TERange`.

 * A severity (`error`, `warning`, ...) that is used to classify how
   important the issue is. The `debputy` framework has more severities
   than the LSP. These severities are visible in `debputy lint`. In the
   LSP mode, the extra severities are mapped to an existing LSP severity
   transparently to the linting code.

 * A text message to show to the user for the diagnostic in question.
   This is a one-liner and this limitation comes from the LSP.

 * A source/authority of the diagnostic. This is also shown to the
   user with the goal of aiding them in understanding where the
   requirement comes from. The authority is typically something like
   `Policy 3.4`.

   The code base checks for known authorities to ensure consistency,
   so you may need to add a new authority. See
   `DIAG_SOURCE_WITHOUT_SECTIONS` and `DIAG_SOURCE_WITH_SECTIONS`
   in `lint_util.py` for known authorities.

Diagnostics are emitted via the `LintState.emit_diagnostic` method.

#### Quick fixes

Beyond the bare minimum, a `Diagnostic` can also benefit from quick-fixes.

Each diagnostic can have a list of quick-fixes. Each element in that list
defines an option for how to fix the diagnostic. Additionally, the quick-fixes
may be used for auto-fixing with `debputy lint --auto-fix` mode.

The list should generally be the return value of one of:

 * `propose_correct_text_quick_fix`: When used, the range provided in the
    diagnostic will be provided with the given string passed to this function.
 * `propose_insert_text_on_line_after_diagnostic_quick_fix`: When used, insert
    the text provided to this function on the fist line starting immediately
    after the end position of the diagnostic range. Remember a trailing newline
    if you intend to create a new line.
 * `propose_remove_range_quick_fix`: When used, the range provided to
   `emit_diagnostic` will be removed. Anything remaining on the line
    (such as trailing newlines) will remain.
 * `propose_remove_line_quick_fix`: When used, the range provided to
   `emit_diagnostic` must span no more than a single line. Said line will
    be deleted.

Note: Feel free to add more of these quick fixes to `debputy.lsp.quickfixes`.

If the fix provided is not good enough for non-interactive fixing such as
`debputy lint --auto-fix`, then pass `enable_non_interactive_auto_fix=False`
to `LintState.emit_diagnostic`. The common case is with spellchecking since
it has a lot of false positives.


# Testing

Most of the relevant tests live in `tests/lint_tests`. There might be some
special-cases or very complex tests in `tests/lsp_tests`.

## Testing diagnostics

When testing diagnostics, the test will generally need a small file that
triggers the issue. Each existing format with tests will define a
`line_linter` fixture that takes a list of lines and runs it through
the diagnostics code.

It is important to note that the diagnostics returned are LSP diagnostics.
As a consequence, they use LSP severities which are not 1:1 with `debputy`
severities. As an example `"pedantic"` and `"spelling"` both become
`DiagnosticSeverity.Hint`.

Useful tricks:

 * Ranges can be turned into a string for a simpler comparison than creating
   a full LSP Range object.
 * The `diag_range_to_text` helper can convert the lines + range into the
   concrete text matched by that range.
 * It is possible to test the quick fix with a bit of extra lifting. See
   `test_dch_lint_signoff_line_email_date_spacing` from
   `test_lint_changelog.py`
 * If the contents of `debian/control` is relevant (and it is not the `dcpy` test),
   remember to set `line_linter.dctrl_lines`.

An example from `tests/lint_tests/lint/test_lint_debputy.py`:

```python
def test_debputy_lint_syntax_error_a(line_linter: LintWrapper) -> None:
    lines = textwrap.dedent(
        """\
    manifest-version: '0.1'
    foo
    """
    ).splitlines(keepends=True)

    diagnostics = line_linter(lines)
    assert len(diagnostics) == 1
    issue = diagnostics[0]

    # We do not check the error message here, since it is directly from the YAML library
    # and therefore an implementation detail.
    assert f"{issue.range}" == "1:0-1:3"
    assert issue.severity == DiagnosticSeverity.Error
    assert diag_range_to_text(lines, issue.range) == "foo"
```

## Testing reformatting

When testing reformatting, the test will generally need a small file
that should be reformatted. Each existing format with tests will define a
`reformater` fixture that takes a list of lines and runs it through
the reformatting code. Note the test has to provide the desired style
for the formatting to do anything.

Generally:

 * Often the test should do at least two passes. The first run without a
   style to ensure the code backs  out without a style. The second run can
   then be done with the `black` style or the style being tested.

 * Edits can be applied via the `apply_formatting_edits` and then compared
   against an expected content.

An example is `test_dctrl_reformat_preserve_comments` from
`tests/lint_tests/reformat/test_reformat_dctrl.py`.