1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
|
# Running the Gauntlet
## Maintaining a Gem Mirror
I use rubygems-mirror to keep an archive of all the latest rubygems on
an external disk. Here is the config:
```
---
- from: https://rubygems.org
to: /Volumes/StuffA/gauntlet/mirror
parallelism: 10
retries: 3
delete: true
skiperror: true
hashdir: true
```
And I update using rake:
```
% cd GIT/rubygems/rubygems-mirror
% git down
% rake mirror:latest
% /Volumes/StuffA/gauntlet/bin/cleanup.rb -y -v
```
This rather quickly updates my mirror to the latest versions of
everything and then deletes all old versions. I then run a cleanup
script that fixes the file dates to their publication date and deletes
any gems that have invalid specs. This can argue with the mirror a
bit, but it is pretty minimal (currently ~20 bad gems).
## Curating an Archive of Ruby Files
Next, I process the gem mirror into a much more digestable structure
using `unpack_gems.rb`.
```
% cd RP/gauntlet
% time caffeinate /Volumes/StuffA/gauntlet/bin/unpack_gems.rb -v [-a] ; say done
... waaaait ...
% DIR=gauntlet.$(today).(all|new).noindex
% mv hashed.noindex $DIR
% tar vc -T <(fd -tf . $DIR | sort) | zstd -5 -T0 --long > archives/$DIR.tar.zst ; say done
% ./bin/sync.sh
```
This script filters all the newer (< 1 year old) gems (unless `-a` is
used), unpacks them, finds all the files that look like they're valid
ruby, ensures they're valid ruby (using the current version of ruby to
compile them), and then moves them into a SHA dir structure that looks
something like this:
```
hashed.noindex/a/b/c/<full_file_sha>.rb
```
This removes all duplicates and puts everything in a fairly even,
wide, flat directory layout.
This process takes a very long time, even with a lot of
parallelization. There are currently about 160k gems in the mirror.
Unpacking, validating, SHA'ing everything is disk and CPU intensive.
The `.noindex` extension stops spotlight from indexing the continous
churn of files being unpacked and moved and saves time.
Finally, I rename and archive it all up (currently using zstd to
compress).
### Stats
```
9696 % find gauntlet.$(today).noindex -type f | lc
561270
3.5G gauntlet.2021-08-06.noindex
239M gauntlet.2021-08-06.noindex.tar.zst
```
So I wind up with a little over half a million unique ruby files to
parse. It's about 3.5g but compresses very nicely down to 240m
## Running the Gauntlet
Assuming you're starting from scratch, unpack the archive once:
```
% zstdcat gauntlet.$(today).noindex.tar.zst | tar x
```
Then, either run a single process (easier to read):
```
% ./gauntlet/bin/gauntlet.rb gauntlet/*.noindex/?
```
Or max out your machine using xargs (note the `-P 16` and choose accordingly):
```
% ls -d gauntlet/*.noindex/?/? | time xargs -n 1 -P 16 ./gauntlet/bin/gauntlet.rb
```
In another terminal I usually monitor the progress like so:
```
% while true ; do clear; fd . -t d -t e gauntlet/*.noindex -X rmdir -p 2> /dev/null ; for D in gauntlet/*.noindex/? ; do echo -n "$D: "; fd .rb $D | wc -l ; done ; echo ; sleep 30 ; done
```
|