File: PKG-INFO

package info (click to toggle)
python-dirhash 0.5.0-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 212 kB
sloc: python: 1,293; makefile: 3
file content (120 lines) | stat: -rw-r--r-- 4,927 bytes
parent folder | download | duplicates (2)
Metadata-Version: 2.1
Name: dirhash
Version: 0.5.0
Summary: Python module and CLI for hashing of file system directories.
Home-page: https://github.com/andhus/dirhash-python
Author: Anders Huss
Author-email: andhus@kth.se
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scantree>=0.0.4


[![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python)

# dirhash

A lightweight python module and CLI for computing the hash of any
directory based on its files' structure and content.

- Supports all hashing algorithms of Python's built-in `hashlib` module.
- Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude.
- Multiprocessing for up to [6x speed-up](#performance)

The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.

## Installation

From PyPI:

```commandline
pip install dirhash
```

Or directly from source:

```commandline
git clone git@github.com:andhus/dirhash-python.git
pip install dirhash/
```

## Usage

Python module:

```python
from dirhash import dirhash

dirpath = "path/to/directory"
dir_md5 = dirhash(dirpath, "md5")
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])
```

CLI:

```commandline
dirhash path/to/directory -a md5
dirhash path/to/directory -a md5 --match "*.py"
dirhash path/to/directory -a sha1 --ignore ".*"  ".*/"
```

## Why?

If you (or your application) need to verify the integrity of a set of files as well
as their name and location, you might find this useful. Use-cases range from
verification of your image classification dataset (before spending GPU-$$$ on
training your fancy Deep Learning model) to validation of generated files in
regression-testing.

There isn't really a standard way of doing this. There are plenty of recipes out
there (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)
and [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python))
but I couldn't find one that is properly tested (there are some gotcha:s to cover!)
and documented with a compelling user interface. `dirhash` was created with this as
the goal.

[checksumdir](https://github.com/cakepietoast/checksumdir) is another python
module/tool with similar intent (that inspired this project) but it lacks much of the
functionality offered here (most notably including file names/structure in the hash)
and lacks tests.

## Performance

The python `hashlib` implementation of common hashing algorithms are highly
optimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and
combines the output. Reasonable measures have been taken to minimize the overhead
and for common use-cases, the majority of time is spent reading data from disk
and executing `hashlib` code.

The main effort to boost performance is support for multiprocessing, where the
reading and hashing is parallelized over individual files.

As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/cli.py)
with the shell command:

`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5`

which is the top answer for the SO-question:
[Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)
Results for two test cases are shown below. Both have 1 GiB of random data: in
"flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in
"nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories
in a binary tree of depth 8.

| Implementation       | Test Case       | Time (s) | Speed up |
| -------------------- | --------------- | -------: | -------: |
| shell reference      | flat_1k_1MB     |     2.29 |   -> 1.0 |
| `dirhash`            | flat_1k_1MB     |     1.67 |     1.36 |
| `dirhash`(8 workers) | flat_1k_1MB     |     0.48 | **4.73** |
| shell reference      | nested_32k_32kB |     6.82 |   -> 1.0 |
| `dirhash`            | nested_32k_32kB |     3.43 |     2.00 |
| `dirhash`(8 workers) | nested_32k_32kB |     1.14 | **6.00** |

The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/tree/master/benchmark).

## Documentation

Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash).