File: 2016-04-19-rule-based-stateful-testing.md

package info (click to toggle)
python-hypothesis 6.138.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 15,272 kB
  • sloc: python: 62,853; ruby: 1,107; sh: 253; makefile: 41; javascript: 6
file content (348 lines) | stat: -rw-r--r-- 11,340 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
tags: python, technical, intro
date: 2016-04-19 07:00
title: Rule Based Stateful Testing
author: drmaciver
---

Hypothesis's standard testing mechanisms are very good for testing things that can be
considered direct functions of data. But supposed you have some complex stateful
system or object that you want to test. How can you do that?

In this article we'll see how to use Hypothesis's *rule based state machines* to define
tests that generate not just simple data, but entire programs using some stateful
object. These will give the same level of boost to testing the behaviour of the
object as you get to testing the data it accepts.

<!--more-->

The model of a stateful system we'll be using is a [priority queue](https://en.wikipedia.org/wiki/Priority_queue)
implemented as a binary heap.

We have the following operations:

* newheap() - returns a new heap
* heappush(heap, value) - place a new value into the heap
* heappop(heap) - remove and return the smallest value currently on the heap. Error if heap is empty.
* heapempty(heap) - return True if the heap has no elements, else False.

We'll use the following implementation of these:

```python
def heapnew():
    return []


def heapempty(heap):
    return not heap


def heappush(heap, value):
    heap.append(value)
    index = len(heap) - 1
    while index > 0:
        parent = (index - 1) // 2
        if heap[parent] > heap[index]:
            heap[parent], heap[index] = heap[index], heap[parent]
            index = parent
        else:
            break


def heappop(heap):
    return heap.pop(0)
```

(Note that this implementation is *wrong*. heappop as implemented will return the smallest element
if the heap currently satisfies the heap property, but it will not rebalance the heap afterwards
so it may leave the heap in an invalid state)

We could test this readily enough using @given with something like the following:


```python
from hypothesis import given
from hypothesis.strategies import integers, lists


@given(lists(integers()))
def test_pop_in_sorted_order(ls):
    h = heapnew()
    for l in ls:
        heappush(h, l)
    r = []
    while not heapempty(h):
        r.append(heappop(h))
    assert r == sorted(ls)
```

And this indeed finds the bug:

```
>       assert r == sorted(ls)
E       assert [0, 1, 0] == [0, 0, 1]
E         At index 1 diff: 1 != 0
E         Use -v to get the full diff

binheap.py:74: AssertionError
----- Hypothesis -----
Falsifying example: test_pop_in_sorted_order(ls=[0, 1, 0])
```

So we replace heappop with a correct implementation which rebalances the heap:

```python
def heappop(heap):
    if len(heap) == 0:
        raise ValueError("Empty heap")
    if len(heap) == 1:
        return heap.pop()
    result = heap[0]
    heap[0] = heap.pop()
    index = 0
    while index * 2 + 1 < len(heap):
        children = [index * 2 + 1, index * 2 + 2]
        children = [i for i in children if i < len(heap)]
        assert children
        children.sort(key=lambda x: heap[x])
        for c in children:
            if heap[index] > heap[c]:
                heap[index], heap[c] = heap[c], heap[index]
                index = c
                break
        else:
            break
    return result
```

But how do we know this is enough? Might some combination of mixing pushes and pops break the
invariants of the heap in a way that this simple pattern of pushing everything then popping
everything cannot witness?

This is where the rule based state machines come in. Instead of just letting Hypothesis give
us data which we feed into a fixed structure of test, we let Hypothesis choose which operations
to perform on our data structure:

```python
from hypothesis.stateful import RuleBasedStateMachine, precondition, rule


class HeapMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.heap = []

    @rule(value=integers())
    def push(self, value):
        heappush(self.heap, value)

    @rule()
    @precondition(lambda self: self.heap)
    def pop(self):
        correct = min(self.heap)
        result = heappop(self.heap)
        assert correct == result
```

@rule is a slightly restricted version of @given that only works for methods on a RuleBasedStateMachine.

However it has one *major* difference from @given, which is that multiple rules can be chained together:
A test using this state machine doesn't just run each rule in isolation, it instantiates an instance of
the machine and then runs multiple rules in succession.

The @precondition decorator constrains when a rule is allowed to fire: We are not allowed to pop from
an empty heap, so the pop rule may only fire when there is data to be popped.

We can run this by getting a standard unit test TestCase object out of it to be picked up by unittest
or py.test as normal:

```python
TestHeaps = HeapMachine.TestCase
```

With our original broken heappop we find the same bug as before:

```
E       AssertionError: assert 0 == 1

binheap.py:90: AssertionError
----- Captured stdout call -----
Step #1: push(value=1)
Step #2: push(value=0)
Step #3: push(value=0)
Step #4: pop()
Step #5: pop()
```

With the fixed implementation the test passes.

As it currently stands, this is already very useful. It's particularly good for testing single standalone
objects or services like storage systems.

But one limitation of it as we have written it is that it only concerns ourselves with a single heap. What
if we wanted to combine two heaps? For example, suppose we wanted a heap merging operation that takes two
heaps and returns a new heap containing the values in either of the original two.

As before, we'll start with a broken implementation:

```python
def heapmerge(x, y):
    x, y = sorted((x, y))
    return x + y
```

We can't just write a strategy for heaps, because each heap would be a fresh object and thus it would not
preserve the stateful aspect.

What we instead do is use the other big feature of Hypothesis's rule bases state machines: Bundles.

Bundles allow rules to return as well as accept values. A bundle is a strategy which generates anything
a rule has previously provided to it. Using them is as follows:


```python
class HeapMachine(RuleBasedStateMachine):
    Heaps = Bundle("heaps")

    @rule(target=Heaps)
    def newheap(self):
        return []

    @rule(heap=Heaps, value=integers())
    def push(self, heap, value):
        heappush(heap, value)

    @rule(heap=Heaps.filter(bool))
    def pop(self, heap):
        correct = min(heap)
        result = heappop(heap)
        assert correct == result

    @rule(target=Heaps, heap1=Heaps, heap2=Heaps)
    def merge(self, heap1, heap2):
        return heapmerge(heap1, heap2)
```

So now instead of a single heap we manage a collection of heaps. All of our previous operations become
constrained by an instance of a heap.

Note the use of filter: A bundle is a strategy you can use like any other. In this case the filter replaces
our use of a precondition because we now only care about whether this *specific* heap is empty.

This is sufficient to find the fact that our implementation is wrong:

```
    @rule(heap=Heaps.filter(bool))
    def pop(self, heap):
        correct = min(heap)
        result = heappop(heap)
>       assert correct == result
E       AssertionError: assert 0 == 1

binheap.py:105: AssertionError

----- Captured stdout call -----

Step #1: v1 = newheap()
Step #2: push(heap=v1, value=0)
Step #3: push(heap=v1, value=1)
Step #4: push(heap=v1, value=1)
Step #5: v2 = merge(heap2=v1, heap1=v1)
Step #6: pop(heap=v2)
Step #7: pop(heap=v2)
```

We create a small heap, merge it with itself, and rapidly discover that it has become unbalanced.

We can fix this by fixing our heapmerge to be correct:

```python
def heapmerge(x, y):
    result = list(x)
    for v in y:
        heappush(result, v)
    return result
```

But that's boring. Lets introduce a more *interestingly* broken implementation instead:

```python
def heapmerge(x, y):
    result = []
    i = 0
    j = 0
    while i < len(x) and j < len(y):
        if x[i] <= y[j]:
            result.append(x[i])
            i += 1
        else:
            result.append(y[j])
            j += 1
    result.extend(x[i:])
    result.extend(y[j:])
    return result
```

This merge operation selectively splices two heaps together as if we were merging two
sorted lists (heaps aren't actually sorted, but the code still works regardless it
just doesn't do anything very meaningful).

This is wrong, but it turns out to work surprisingly well for small heaps and it's
not completely straightforward to find an example showing that it's wrong.

Here's what Hypothesis comes up with:

```
Step #1: v1 = newheap()
Step #2: push(heap=v1, value=0)
Step #3: v2 = merge(heap1=v1, heap2=v1)
Step #4: v3 = merge(heap1=v2, heap2=v2)
Step #5: push(heap=v3, value=-1)
Step #6: v4 = merge(heap1=v1, heap2=v2)
Step #7: pop(heap=v4)
Step #8: push(heap=v3, value=-1)
Step #9: v5 = merge(heap1=v1, heap2=v2)
Step #10: v6 = merge(heap1=v5, heap2=v4)
Step #11: v7 = merge(heap1=v6, heap2=v3)
Step #12: pop(heap=v7)
Step #13: pop(heap=v7)
```

Through a careful set of heap creation and merging, Hypothesis manages to find a series
of merges that produce an unbalanced heap. Every heap prior to v7 is balanced, but v7 looks
like this:

```
>>> v7
[-1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0]
```

Which doesn't satisfy the heap property because of that -1 far down in the list.

I don't know about you, but I would never have come up with that example. There's probably
a simpler one given a different set of operations - e.g. one thing that would probably improve
the quality of this test is to let Hypothesis instantiate a new heap with a list of elements
which it pops onto it.

But the nice thing about rule based stateful testing is that I don't *have* to come up with
the example. Instead Hypothesis is able to guarantee that every combination of operations
on my objects works, and can flush out some remarkably subtle bugs in the process.

Because after all, if it takes this complicated an example to demonstrate that a completely
wrong implementation is wrong, how hard can it sometimes be to demonstrate subtle bugs?

### Real world usage

This feature is currently somewhat under-documented so hasn't seen as widespread adoption as
it could. However, there are at least two interesting real world examples:

1. Hypothesis uses it to test itself. Hypothesis has [tests of its example database](
   https://github.com/HypothesisWorks/hypothesis-python/blob/master/tests/cover/test_database_agreement.py)
   which work very much like the above, and [a small model of its test API](
   https://github.com/HypothesisWorks/hypothesis-python/blob/master/tests/nocover/test_strategy_state.py)
   which generates random strategies and runs tests using them.
2. It's being used to [test Mercurial](https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-February/080037.html)
   generating random. So far it's found [bug 5112](https://bz.mercurial-scm.org/show_bug.cgi?id=5112) and
   [bug 5113](https://bz.mercurial-scm.org/show_bug.cgi?id=5113). The usage pattern on Mercurial is one
   such that the stateful testing probably needs more resources, more rules and more work on deployment
   before it's going to find much more than that though.