File: howitworks.md

package info (click to toggle)
sqlreduce 1.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 12,172 kB
  • sloc: python: 1,026; sql: 40; sh: 22; makefile: 14
file content (275 lines) | stat: -rw-r--r-- 11,063 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
SQLreduce: Reduce verbose SQL queries to minimal examples
=========================================================

![SQLreduce logo](sqlreduce.png)

Developers often face very large SQL queries that raise some error. SQLreduce
is a tool to reduce that complexity to a minimal query.

## SQLsmith generates random SQL queries

[SQLsmith](https://github.com/anse1/sqlsmith) is a tool that generates random SQL
queries and runs them against a PostgreSQL server (and other DBMS types). The
idea is that by fuzz-testing the query parser and executor, corner-case bugs
can be found that would otherwise go unnoticed in manual testing or with the
fixed set of test cases in PostgreSQL's regression test suite. It has proven to
be an
[effective tool](https://github.com/anse1/sqlsmith/wiki#score-list)
with over 100 bugs found in different areas in the PostgreSQL server and
other products since 2015, including security bugs, ranging from executor
bugs to segfaults in type and index method implementations.
For example, in 2018, SQLsmith found that the following query
[triggered a segfault in PostgreSQL](https://www.postgresql.org/message-id/flat/87woxi24uw.fsf%40ansel.ydns.eu):

```
select
  case when pg_catalog.lastval() < pg_catalog.pg_stat_get_bgwriter_maxwritten_clean() then case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       else case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       end as c0,
  case when (select intervalcol from public.brintest limit 1 offset 1)
         >= cast(null as "interval") then case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       else case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       end as c1,
  ref_0.a as c2,
  (select a from public.idxpart1 limit 1 offset 5) as c3,
  ref_0.b as c4,
    pg_catalog.stddev(
      cast((select pg_catalog.sum(float4col) from public.brintest)
         as float4)) over (partition by ref_0.a,ref_0.b,ref_0.c order by ref_0.b) as c5,
  cast(nullif(ref_0.b, ref_0.a) as int4) as c6, ref_0.b as c7, ref_0.c as c8
from
  public.mlparted3 as ref_0
where true;
```

However, just like in this 40-line, 2.2kB example, the random queries generated
by SQLsmith that trigger some error are most often very large and contain a lot
of noise that does not contribute to the error. So far, manual inspection of
the query and tedious editing was required to reduce the example to a minimal
reproducer that developers can use to fix the problem.

## Reduce complexity with SQLreduce

This issue is solved by [SQLreduce](https://github.com/df7cb/sqlreduce).
SQLreduce takes as input an arbitrary SQL query which is then run against a
PostgreSQL server. Various simplification steps are applied, checking after
each step that the simplified query still triggers the same error from
PostgreSQL. The end result is a SQL query with minimal complexity.

SQLreduce is effective at reducing the queries from
[original error reports from SQLsmith](https://github.com/anse1/sqlsmith/wiki#score-list)
to queries that match manually-reduced queries. For example, SQLreduce can
effectively reduce the above monster query to just this:

```
SELECT pg_catalog.stddev(NULL) OVER () AS c5 FROM public.mlparted3 AS ref_0
```

Note that SQLreduce does not try to derive a query that is semantically identical
to the original, or produces the same query result - the input is assumed to be
faulty, and we are looking for the minimal query that produces the same error
message from PostgreSQL when run against a database. If the input query happens
to produce no error, the minimal query output by SQLreduce will just be
`SELECT`.

## How it works

We'll use a simpler query to demonstrate how SQLreduce works and which steps
are taken to remove noise from the input. The query is bogus and contains a bit
of clutter that we want to remove:

```
$ psql -c 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'
ERROR:  column pg_database.reltuples does not exist
```

Let's pass the query to SQLreduce:

```
$ sqlreduce 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'
```

SQLreduce starts by parsing the input using
[pglast](https://github.com/lelit/pglast) and
[libpg_query](https://github.com/pganalyze/libpg_query) which expose the
original PostgreSQL parser as a library with Python bindings. The result is a
parse tree that is the basis for the next steps.
The parse tree looks like this:

```
selectStmt
├── targetList
│   └── /
│       ├── pg_database.reltuples
│       └── 1000
├── fromClause
│   ├── pg_database
│   └── pg_class
├── whereClause
│   └── <
│       ├── 0
│       └── /
│           ├── pg_database.reltuples
│           └── 1000
├── orderClause
│   └── 1
└── limitCount
    └── 10
```

Pglast also contains a query renderer that can render back the parse tree as
SQL, shown as the regenerated query below. The input query is run against
PostgreSQL to determine the result, in this case
`ERROR:  column pg_database.reltuples does not exist`.

```
Input query: select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10
Regenerated: SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC LIMIT 10
Query returns: ✔ ERROR:  column pg_database.reltuples does not exist
```

SQLreduce works by deriving new parse trees that are structurally simpler,
generating SQL from that, and run these queries against the database. The first
simplification steps work on the top level node, where SQLreduce tries to
remove whole subtrees to quickly find a result. The first reduction tried is to
remove `LIMIT 10`:

```
SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC ✔
```

The query result is still `ERROR:  column pg_database.reltuples does not
exist`, indicated by a ✔ check mark. Next, `ORDER BY 1` is removed, again
successfully:

```
SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔
```

Now the entire target list is removed:

```
SELECT FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔
```

This shorter query is still equivalent to the original regarding the error
message returned when it is run against the database. Now the first unsuccessful
reduction step is tried, removing the entire `FROM` clause:

```
SELECT WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"
```

That query is also faulty, but triggers a different error message, so the
previous parse tree is kept for the next steps. Again a whole subtree is
removed, now the `WHERE` clause:

```
SELECT FROM pg_database, pg_class ✘ no error
```

We have now reduced the input query so much that it doesn't error out any more. The previous parse tree
is still kept which now looks like this:

```
selectStmt
├── fromClause
│   ├── pg_database
│   └── pg_class
└── whereClause
    └── <
        ├── 0
        └── /
            ├── pg_database.reltuples
            └── 1000
```

Now SQLreduce starts digging into the tree. There are several entries in the
`FROM` clause, so it tries to shorten the list. First, `pg_database` is
removed, but that doesn't work, so `pg_class` is removed:

```
SELECT FROM pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"
SELECT FROM pg_database WHERE 0 < ((pg_database.reltuples / 1000)) ✔
```

Since we have found a new minimal query, recursion restarts at top-level with
another try to remove the `WHERE` clause. Since that doesn't work, it tries to
replace the expression with `NULL`, but that doesn't work either.

```
SELECT FROM pg_database ✘ no error
SELECT FROM pg_database WHERE NULL ✘ no error
```

Now a new kind of step is tried: expression pull-up. We descend into `WHERE`
clause, where we replace `A < B` first by `A` and then by `B`.

```
SELECT FROM pg_database WHERE 0 ✘ ERROR:  argument of WHERE must be type boolean, not type integer
SELECT FROM pg_database WHERE pg_database.reltuples / 1000 ✔
SELECT WHERE pg_database.reltuples / 1000 ✘ ERROR:  missing FROM-clause entry for table "pg_database"
```

The first try did not work, but the second one did. Since we simplified the
query, we restart at top-level to check if the `FROM` clause can be removed,
but it is still required.

From `A / B`, we can again pull up `A`:

```
SELECT FROM pg_database WHERE pg_database.reltuples ✔
SELECT WHERE pg_database.reltuples ✘ ERROR:  missing FROM-clause entry for table "pg_database"
```

SQLreduce has found the minimal query that still raises `ERROR:  column
pg_database.reltuples does not exist` with this parse tree:

```
selectStmt
├── fromClause
│   └── pg_database
└── whereClause
    └── pg_database.reltuples
```

At the end of the run, the query is printed along with some statistics:

```
Minimal query yielding the same error:
SELECT FROM pg_database WHERE pg_database.reltuples

Pretty-printed minimal query:
SELECT
FROM pg_database
WHERE pg_database.reltuples

Seen: 15 items, 915 Bytes
Iterations: 19
Runtime: 0.107 s, 139.7 q/s
```

This minimal query can now be inspected to fix the bug in PostgreSQL or in the
application.