File: gauntlet.md

package info (click to toggle)
ruby-ruby-parser 3.22.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 8,412 kB
  • sloc: ruby: 148,194; yacc: 6,401; makefile: 11
file content (137 lines) | stat: -rw-r--r-- 4,277 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Running the Gauntlet

## Maintaining a Gem Mirror

I use rubygems-mirror to keep an archive of all the latest rubygems on
an external disk. Here is the config:

```
---
- from: https://rubygems.org
  to: /Volumes/StuffA/gauntlet/mirror
  parallelism: 10
  retries: 3
  delete: true
  skiperror: true
  hashdir: true
```

And I update using rake:

```
% cd GIT/rubygems/rubygems-mirror
% git down
% rake mirror:latest
% /Volumes/StuffA/gauntlet/bin/cleanup.rb -y -v
```

This rather quickly updates my mirror to the latest versions of
everything and then deletes all old versions. I then run a cleanup
script that fixes the file dates to their publication date and deletes
any gems that have invalid specs. This can argue with the mirror a
bit, but it is pretty minimal (currently ~20 bad gems).

## Curating an Archive of Ruby Files

Next, I process the gem mirror into a much more digestable structure
using `unpack_gems.rb`.

```
% cd RP/gauntlet
% time caffeinate ./bin/unpack_gems.rb -v [-a] ; say done
... waaaait ...
% DIR=gauntlet.$(today).(all|new).noindex
% mv hashed.noindex $DIR
% tar vc -T <(fd -tf . $DIR | sort) | zstdmt -12 --long > archives/$DIR.tar.zst ; say done
% ./bin/sync.sh
```

This script filters all the newer (< 1 year old) gems (unless `-a` is
used), unpacks them, finds all the files that look like they're valid
ruby, ensures they're valid ruby (using the current version of ruby to
compile them), and then moves them into a SHA dir structure that looks
something like this:

```
hashed.noindex/a/b/c/<full_file_sha>.rb
```

This removes all duplicates and puts everything in a fairly even,
wide, flat directory layout.

This process takes a very long time, even with a lot of
parallelization. There are currently about 160k gems in the mirror.
Unpacking, validating, SHA'ing everything is disk and CPU intensive.
The `.noindex` extension stops spotlight from indexing the continous
churn of files being unpacked and moved and saves time.

Finally, I rename and archive it all up (currently using zstd to
compress).

### Stats

```
9696 % fd -tf . gauntlet.$(today).noindex | wc -l
  561270
3.5G gauntlet.2021-08-06.noindex
239M gauntlet.2021-08-06.noindex.tar.zst
```

So I wind up with a little over half a million unique ruby files to
parse. It's about 3.5g but compresses very nicely down to 240m

## Running the Gauntlet

Assuming you're starting from scratch, unpack the archive once:

```
% tar xf gauntlet.$(today).noindex.tar.zst
```

(BSD tar (and apparently newer gnu tars) can detect and uncompress
most compression formats)

Then, either run a single process (easier to read):

```
% ./gauntlet/bin/gauntlet.rb gauntlet/*.noindex/?
```

Or max out your machine using xargs (note the `-P 16` and choose accordingly):

```
% ls -d gauntlet/*.noindex/?/? | time xargs -n 1 -P 16 ./gauntlet/bin/gauntlet.rb
```

In another terminal I usually monitor the progress like so:

```
% while true ; do clear; fd . -td -te gauntlet/*.noindex -X rmdir -p 2> /dev/null ; for D in gauntlet/*.noindex/? ; do echo -n "$D: "; fd .rb $D | wc -l ; done ; echo ; sleep 30 ; done
```

After this is run and done, there will be files left over that
couldn't be parsed. There will also be a directory with a name like
`gauntlet.slow.1` of files that timed out. What I generally do is wait
for the first run to end and then start increasing the timeout and run
again on the timeout dir:

```
$ ls -d gauntlet.slow.1/*.noindex/?/? | RP_TIMEOUT=30 time xargs -n 1 -P 16 ./gauntlet/bin/gauntlet.rb
# or:
$ RP_TIMEOUT=30 time ./gauntlet/bin/gauntlet.rb gauntlet.slow.*
$ RP_TIMEOUT=60 time ./gauntlet/bin/gauntlet.rb gauntlet.slow.*
$ fd -tf . gauntlet.slow.60/
gauntlet.slow.60/gauntlet.2025-10-22.new.noindex/2/f/f/2ff00bbd2ee63b2145d247570c130823dce2b9fe.rb
gauntlet.slow.60/gauntlet.2025-10-22.new.noindex/a/a/4/aa44d5a214217036425bf8fce5a7ab5b0e04fd92.rb
```

for the most part, you wind up with absurdly large generated ruby files:

```
10022 $ wc -l gauntlet.slow.60/*/?/?/?/*.rb
  412444 gauntlet.slow.60/gauntlet.2025-10-22.new.noindex/2/f/f/2ff00bbd2ee63b2145d247570c130823dce2b9fe.rb
  295249 gauntlet.slow.60/gauntlet.2025-10-22.new.noindex/a/a/4/aa44d5a214217036425bf8fce5a7ab5b0e04fd92.rb
  707693 total
```

and I don't care so much about these.