1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
|
---
tags: python, intro, technical, properties
date: 2016-04-16 06:00
title: The Encode/Decode invariant
author: drmaciver
---
One of the simplest types of invariant to find once you move past
[just fuzzing your code](../getting-started-with-hypothesis/) is asserting that two
different operations should produce the same result, and one of the simplest instances of
*that* is looking for encode/decode pairs. That is, you have some function that takes a
value and encodes it as another value, and another that is supposed to reverse the process.
This is ripe for testing with Hypothesis because it has a natural completely defined
specification: Encoding and then decoding should be exactly the same as doing nothing.
Lets look at a concrete example.
<!--more-->
The following code is a lightly reformatted version of
an implementation of [Run Length Encoding](https://en.wikipedia.org/wiki/Run-length_encoding)
taken [from Rosetta Code](http://rosettacode.org/wiki/Run-length_encoding).
```python
def encode(input_string):
count = 1
prev = ""
lst = []
for character in input_string:
if character != prev:
if prev:
entry = (prev, count)
lst.append(entry)
count = 1
prev = character
else:
count += 1
else:
entry = (character, count)
lst.append(entry)
return lst
def decode(lst):
q = ""
for character, count in lst:
q += character * count
return q
```
We can test this using Hypothesis and py.test as follows:
```python
from hypothesis import given
from hypothesis.strategies import text
@given(text())
def test_decode_inverts_encode(s):
assert decode(encode(s)) == s
```
This asserts what we described above: If we encode a string as run length encoded and then
decode it, we get back to where we started.
This test finds a bug, not through the actual invariant. Instead it finds one through pure
fuzzing: The code does not correctly handle the empty string.
```
Falsifying example: test_decode_inverts_encode(s='')
UnboundLocalError: local variable 'character' referenced before assignment
```
One of the nice features of testing invariants is that they incorporate the fuzzing you
could be doing anyway, more or less for free, so even trivial invariants can often
find interesting problems.
We can fix this bug by adding a guard to the encode function:
```python
if not input_string:
return []
```
The test now passes, which isn't very interesting, so lets break the code. We'll delete
a line from our implementation of encode which resets the count when the character changes:
```python
def encode(input_string):
if not input_string:
return []
count = 1
prev = ""
lst = []
for character in input_string:
if character != prev:
if prev:
entry = (prev, count)
lst.append(entry)
# count = 1 # Missing reset operation
prev = character
else:
count += 1
else:
entry = (character, count)
lst.append(entry)
return lst
```
Now the test fails:
```
@given(text())
def test_decode_inverts_encode(s):
> assert decode(encode(s)) == s
E assert '1100' == '110'
E - 1100
E ? -
E + 110
test_encoding.py:35: AssertionError
------------------------------------ Hypothesis ------------------------------------
Falsifying example: test_decode_inverts_encode(s='110')
```
Not resetting the count did indeed produce unintended data that doesn't translate back
to the original thing. Hypothesis has given us the shortest example that could trigger
it - two identical characters followed by one different one. It's not *quite* the
simplest example according to Hypothesis's preferred ordering - that would be '001' -
but it's still simple enough to be quite legible, which helps to rapidly diagnose
the problem when you see it in real code.
Encode/decode loops like this are *very* common, because you will frequently want to
serialize your domain objects to other representations - into forms, into APIs, into
the database, and these are things that are so integral to your applications that it's
worth getting all the edge cases right.
Other examples of this:
* [This talk by Matt Bacchman](https://speakerdeck.com/bachmann1234/property-based-testing-hypothesis)
in which he discovers an eccentricity of formats for dates.
* Mercurial bugs [4927](https://bz.mercurial-scm.org/show_bug.cgi?id=4927) and [5031](https://bz.mercurial-scm.org/show_bug.cgi?id=5031)
were found by applying this sort of testing to their internal UTF8b encoding functions.
* [This test](https://github.com/The-Compiler/qutebrowser/blob/24a71e5c2ebbffd9021694f32fa9ec51d0046d5a/tests/unit/browser/test_webelem.py#L652).
Has caught three bugs in Qutebrowser's JavaScript escaping ([1](https://github.com/The-Compiler/qutebrowser/commit/73e9fd11188ce4dddd7626e39d691e0df649e87c),
[2](https://github.com/The-Compiler/qutebrowser/commit/93d27cbb5f49085dd5a7f5e05f2cc45cc84f94a4),
[3](https://github.com/The-Compiler/qutebrowser/commit/24a71e5c2ebbffd9021694f32fa9ec51d0046d5a)), which could have caused data loss if a user had run
into them.
|