1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
|
# Tabular file compare
zsv's `compare` command compares multiple tables and outputs differences in
tabular or JSON format.
## Background
Examples where multiple tables may contain differences that need to be
identified include:
- preliminary vs final data, where a transaction or activity involves a set of
resources that may have changed
- transactions involving assets that are known to multiple parties, each with
independent systems of record
- predictive model updates where values in a column of outputs may change when a
model version changes
## Challenges
Comparing tables of data presents several challenges, some of which are
addressed by `zsv compare` and others that are not.
Challenges that `zsv compare` addresses include:
- Comparing multiple tables currently is somewhat achievable using a collection
of common utilities, but requires different solutions for different operating
systems and platforms, and typically requires custom scripting that depends on
both platform and specific input schema. Searching for "table comparison" on
StackOverflow, for example, yields multiple questions whose answers all
involving different sets of tools and custom scripts (and many of which assume
the tables already reside in a relational database), and even fewer that
consider performance. A better solution would be consistent across platforms,
support canned options for common use cases, have few or no additional
dependencies, and well-defined performance expectations and limitations
- Achieve high performance and scalability with bounded memory (for pre-sorted
input)
- Columns might not appear in the same order across all inputs
- Column population may not be the same across all inputs
- Column names might be duplicated within any one input
- Values may have leading whitespace that should be ignored
- Matching might be 1-1, but alternatively might be by one or more 'key' columns
- Row population may not be identical; one or more input(s) may have rows (as
identified by 'key' columns) that one or more other input(s) do not
- The number of inputs to compare might exceed 2
- Inputs may be large and memory may be limited
- Desired output format may be any of myriad tabular, JSON or other formats
- Case-insensitive matching across a full Unicode character set
- Desired output may include additional columns for context
- User-defined numeric tolerance e.g. treat 1.1999999 and 1.2 as equal
Challenges that `zsv compare` aims to solve for limited cases include:
- Input data might be unsorted, but small enough to sort with reasonable
performance using [vanilla sqlite3
sort](https://www.sqlite.org/eqp.html#temporary_sorting_b_trees)
- Numerical formats can differ in text but represent the same values.
zsv handles basic numeric format normalization to the extent supported by native C functions
(such as "1.2e-2" vs "0.012"); this does not handle other formats
(such as "1,000" or "$1000" as 1000)
Challenges that `zsv compare` does not try to solve include:
- The name of any given column to compare across inputs might differ across
inputs (e.g. input 1 contains "My Column X" that we want to compare against
input 2's "My_Column_X")
- Exact comparison may be undesirable when content differs cosmetically but not
substantively, e.g. in scale ("70" vs "0.70"), format ("1/1/2023" vs
"2023-01-01", or "70%" vs "0.7"), enumeration ("Washington" vs "WA"),
precision ("5.2499999999999" vs "5.25") and/or other
- When comparing large, unsorted datasets in order to sort prior to comparison,
a high-performance sort that offers parallelization, multi-threading,
algorithm control and/or other high-performance sort features may be desired
(If you are an interested in solutions to these kinds of problems, please
contact <a href="mailto:info@liquidaty.com">Liquidaty</a> and/or check out
<https://hub.liquidaty.com>.)
## Matching and sorting
Row matching and sorting is handled as follows:
- Rows between inputs are matched either by row number or by one or more
specified key columns
- Input is assumed to be sorted and uses bounded memory
- Unsorted input can still be processed; will sort using the [sqlite3
API](https://www.sqlite.org/eqp.html#temporary_sorting_b_trees)
## Example
Imagine we want to compare the following two tables:
### Table 1 ([t1.csv](../data/compare/t1.csv))
| Country | City | AccentCity | Region | Population | Latitude | Longitude |
| ------- | -------------- | -------------- | ------ | ---------- | ---------- | ----------- |
| de | placken | Placken | 07 | | 52.15 | 8.433333 |
| gb | duffryn | Duffryn | Z3 | | 51.4375 | -3.304444 |
| id | kemitang | Kemitang | 07 | | -7.544444 | 109.183333 |
| ie | oggal | Oggal | 02 | | 54.2597222 | -7.9266667 |
| mk | cvetovo | Cvetovo | 92 | | 41.8580556 | 21.4097222 |
| pl | ciesle male | Ciesle Male | 86 | | 52.176861 | 17.649143 |
| ru | chishmabash | Chishmabash | 73 | | 55.4708 | 53.8996 |
| tz | lituhi mission | Lituhi Mission | 14 | | -10.55 | 34.6 |
| us | steep falls | Steep Falls | ME | | 43.7938889 | -70.6530556 |
| za | hota | Hota | 05 | | -31.640685 | 27.681357 |
| zr | kakova | Kakova | 09 | | 0.7333333 | 29.0166667 |
### Table 2 ([t2.csv](../data/compare/t2.csv))
| Region | Population | Latitude | Longitude | Country | City | AccentCity |
| ------ | ---------- | ---------- | ----------- | ------- | ------------------- | -------------- |
| 92 | | 41.8580556 | 21.4097222 | mk | cvetovo | Cvetovo |
| 86 | | 52.176861 | 17.649143 | pl | ciesle male | Ciesle XXX |
| 39 | | 41.677473 | 27.067859 | tr | yenioe | Yenioe |
| 14 | | -10.55 | 34.6 | tz | lituhi mission | Lituhi Mission |
| ME | | 43.7938889 | -70.6530556 | us | steep falls | Steep Falls |
| 05 | | -31.640685 | 27.681357 | za | hota | Hota |
| XX | | 0.7333333 | 29.0166667 | zr | kakova | Kakova |
| 03 | | 26.934908 | 115.628584 | cn | fulongling | Fulongling |
| 07 | | 52.15 | 10.4 | de | placken | Placken |
| Z3 | | 51.4375 | -3.304444 | gb | duffryn | Duffryn |
| 07 | | -7.544444 | 109.183333 | id | kemitang | Kemitang |
| 12 | | 52.9911111 | -6.8938889 | ie | burtown cross roads | Burtown XXX |
| 02 | | 54.2597222 | -7.9266667 | ie | oggal | Oggal |
| 16 | | 34.788889 | 127.670833 | kr | chusamdong | Chusamdong |
Then:
- rows should be matched on the key combination of city and country
- input will need to be sorted
- each table has a slightly different population
We can run:
```shell
zsv compare t1.csv t2.csv --sort -k country -k city
```
and get the following output:
| country | city | Column | t1.csv | t2.csv |
| ------- | ------------------- | ----------- | ----------- | ---------- |
| cn | fulongling | <key> | Missing | |
| de | placken | Longitude | 8.433333 | 10.4 |
| ie | burtown cross roads | <key> | Missing | |
| kr | chusamdong | <key> | Missing | |
| pl | ciesle male | AccentCity | Ciesle Male | Ciesle XXX |
| ru | chishmabash | <key> | | Missing |
| tr | yenioe | <key> | Missing | |
| zr | kakova | Region | 09 | XX |
Or in either of two JSON formats:
```shell
zsv compare --json t1.csv t2.csv --sort -k country -k city
```
to get:
```json
[
[
"country",
"city",
"Column",
"t1.csv",
"t2.csv"
],
[
"cn",
"fulongling",
"<key>",
"Missing",
null
],
[
"de",
"placken",
"Longitude",
"8.433333",
"10.4"
],
...
]
```
or:
```shell
zsv compare --json-object t1.csv t2.csv --sort -k country -k city
```
to get:
```json
[
{
"country": "cn",
"city": "fulongling",
"Column": "<key>",
"t1.csv": "Missing"
},
{
"country": "de",
"city": "placken",
"Column": "Longitude",
"t1.csv": "8.433333",
"t2.csv": "10.4"
},
...
]
```
and in each case if we wanted to include additional data in the output for
context, we can do so using `--add`, e.g.:
```shell
zsv compare --add accentcity t1.csv t2.csv --sort -k country -k city
```
which outputs:
| country | city | accentcity | Column | t1.csv | t2.csv |
| ------- | ------------------- | -------------- | ---------- | ----------- | ---------- |
| cn | fulongling | Placken | <key> | Missing | |
| de | placken | Placken | Longitude | 8.433333 | 10.4 |
| ie | burtown cross roads | Oggal | <key> | Missing | |
| kr | chusamdong | Cvetovo | <key> | Missing | |
| pl | ciesle male | Ciesle Male | AccentCity | Ciesle Male | Ciesle XXX |
| ru | chishmabash | Chishmabash | <key> | | Missing |
| tr | yenioe | Lituhi Mission | <key> | Missing | |
| zr | kakova | Kakova | Region | 9 | XX |
## Performance
No rigorous benchmarking has yet been performed, but preliminary testing yields
reasonable performance and memory usage.
Using a 2019 MBA, running a comparison of two 40MB CSV files, each a table of
100,000 rows with 61 columns, containing approximately 60,000 differences, took
about 5.5 seconds and used a maximum about 1.8MB of RAM on a 2019 MBA.
The same test with sorting used significantly more memory (up to ~40MB) and took
about 7.8 seconds to complete.
## Usage details
```shell
zsv compare -h
Usage: compare [options] [file1.csv] [file2.csv] [...]
Options:
-h,--help : show usage
-k,--key <colname> : specify a column to match rows on
can be specified multiple times
-a,--add <colname> : specify an additional column to output
will use the [first input] source
--sort : sort on keys before comparing
--sort-in-memory : for sorting, use in-memory instead of temporary db
(see https://www.sqlite.org/inmemorydb.html)
--json : output as JSON
--json-compact : output as compact JSON
--json-object : output as an array of objects
--print-key-colname: when outputting key column diffs,
print column name instead of <key>
NOTES
If no keys are specified, each row from each input is compared to the
row in the corresponding position in each other input (all the first rows
from each input are compared to each other, all the second rows are compared to
each other, etc).
If one or more key is specified, each input is assumed to already be
lexicographically sorted in ascending order; this is a necessary condition
for the output to be correct (unless the --sort option is used). However, it
is not required for each input to contain the same population of row keys
The --sort option uses sqlite3 (unindexed) sort and is intended to be a
convenience rather than performance feature. If you need high performance
sorting, other solutions, such as a multi-threaded parallel sort, are likely
superior. For handling quoted data, `2tsv` can be used to convert to a delimited
format without quotes, that can be directly parsed with common UNIX utilities
(such as `sort`), and `select --unescape` can be used to convert back
```
|