File: concept.md

package info (click to toggle)
python-thinc 8.1.7-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 5,804 kB
  • sloc: python: 15,818; javascript: 1,554; ansic: 342; makefile: 20; sh: 13
file content (352 lines) | stat: -rw-r--r-- 14,099 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
title: Concept and Design
teaser: Thinc's conceptual model and how it works
next: /docs/install
---

Thinc is built on a fairly simple conceptual model that's a little bit different
from other neural network libraries. On this page, we build up the library from
first principles, so you can see how everything fits together. This page assumes
some conceptual familiarity with [backpropagation](/docs/backprop101), but you
should be able to follow along even if you're hazy on some of the details.

## The model composition problem

The central problem for a neural network implementation is this: during the
**forward pass**, you compute results that will later be useful during the
**backward pass**. How do you keep track of this arbitrary state, while making
sure that layers can be cleanly composed?

Instead of starting with the problem directly, let's start with a simple and
obvious approach, so that we can run into the problem more naturally. The most
obvious idea is that we have some thing called a `model`, and this thing holds
some parameters ("weights") and has a method to predict from some inputs to some
outputs using the current weights. So far so good. But we also need a way to
update the weights. The most obvious API for this is to add an `update` method,
which will take a batch of inputs and a batch of correct labels, and compute the
weight update.

```python
class UncomposableModel:
    def __init__(self, W):
        self.W = W

    def predict(self, inputs):
        return inputs @ self.W.T

    def update(self, inputs, targets, learn_rate=0.001):
        guesses = self.predict(inputs)
        d_guesses = (guesses-targets) / targets.shape[0]  # gradient of loss w.r.t. output
        # The @ is newish Python syntax for matrix multiplication
        d_inputs = d_guesses @ self.W
        dW = d_guesses.T @ inputs  # gradient of parameters
        self.W -= learn_rate * dW  # update weights
        return d_inputs
```

This API design works in itself, but the `update()` method only works as the
outer-level API. You wouldn't be able to put another layer with the same API
after this one and backpropagate through both of them. Let's look at the steps
for backpropagating through two matrix multiplications:

```python
def backprop_two_layers(W1, W2, inputs, targets):
    hiddens = inputs @ W1.T
    guesses = hiddens @ W2.T
    d_guesses = (guesses-targets) / targets.shape[0]  # gradient of loss w.r.t. output
    dW2 = d_guesses @ hiddens.T
    d_hiddens = d_guesses @ W2
    dW1 = d_hiddens @ inputs.T
    d_inputs = d_hiddens @ W1
    return dW1, dW2, d_inputs
```

In order to update the first layer, we need to know the gradient with respect to
its output. We can't calculate that value until we've finished the full forward
pass, calculated the gradient of the loss, and then backpropagated through the
second layer. This is why the `UncomposableModel` is uncomposable: the `update`
method expects the input and the target to both be available. That only works
for the outermost API – the same API can't work for intermediate layers.

Although nobody thinks of it this way, reverse-model auto-differentiation (as
supported by PyTorch, Tensorflow, etc) can be seen as a solution to this API
problem. The solution is to base the API around the `predict` method, which
doesn't have the same composition problem: there's no problem with writing
`model3.predict(model2.predict(model1.predict(X)))`, or
`model3.predict(model2.predict(X) + model1.predict(X))`, etc. We can easily
build a larger model from smaller functions when we're programming the forward
computations, and so that's exactly the API that reverse-mode
autodifferentiation was invented to offer.

The key idea behind Thinc is that it's possible to just fix the API problem
directly, so that models can be composed cleanly both forwards and backwards.
This results in an interestingly different developer experience: the code is far
more explicit and there are very few details of the framework to consider.
There's potentially more flexibility, but potentially lost performance and
sometimes more opportunities to make mistakes.

We don't want to suggest that Thinc's approach is uniformly better than a
high-performance computational graph engine such as PyTorch or Tensorflow. It
isn't. The trick is to use them together: you can use PyTorch, Tensorflow or
some other library to do almost all of the actual computation, while doing
almost all of your programming with a much more transparent, flexible and
simpler system. Here's how it works.

## No (explicit) computational graph – just higher order functions

The API design problem we're facing here is actually pretty basic. We're trying
to compute two values, but before we can compute the second one, we need to pass
control back to the caller, so they can use the first value to give us an extra
input. The general solution to this type of problem is a **callback**, and in
fact a callback is exactly what we need here.

Specifically, we need to make sure our model functions return a result, and then
a callback that takes a gradient of outputs, and computes the corresponding
gradient of inputs.

```python
def forward(X: InT) -> Tuple[OutT, Callable[[OutT], InT]]:
    Y: OutT = _do_whatever_computation(X)

    def backward(dY: OutT) -> InT:
        dX: InputType = _do_whatever_backprop(dY, X)
        return dX

    return Y, backward
```

To make this less abstract, here are two [layers](/docs/api-layers) following
this signature. For now, we'll stick to layers that don't introduce any
trainable weights, to keep things simple.

```python
### reduce_sum layer
def reduce_sum(X: Floats3d) -> Tuple[Floats2d, Callable[[Floats2d], Floats3d]]:
    Y = X.sum(axis=1)
    X_shape = X.shape

    def backprop_reduce_sum(dY: Floats2d) -> Floats3d:
        dX = zeros(X_shape)
        dX += dY.reshape((dY.shape[0], 1, dY.shape[1]))
        return dX

    return Y, backprop_reduce_sum
```

```python
### Relu layer
def relu(inputs: Floats2d) -> Tuple[Floats2d, Callable[[Floats2d], Floats2d]]:
    mask = inputs >= 0
    def backprop_relu(d_outputs: Floats2d) -> Floats2d:
        return d_outputs * mask
    return inputs * mask, backprop_relu

```

Notice that the `reduce_sum` layer's output is a different shape from its input.
The forward pass runs from input to output, while the backward pass runs from
gradient-of-output to gradient-of-input. This means that we'll always have two
matching pairs: `(input_to_forward, output_of_backprop)` and
`(output_of_forward, input_of_backprop)`. These pairs must match in type. If our
functions obey this invariant, we'll be able to write
[combinator functions](/docs/api-layers#combinators) that can wire together
layers in standard ways.

The most basic way we'll want to combine layers is a feed-forward relationship.
We call this combinator `chain`, after the chain rule:

```python
### Chain combinator
def chain(layer1, layer2):
    def forward_chain(X):
        Y, get_dX = layer1(X)
        Z, get_dY = layer2(Y)

        def backprop_chain(dZ):
            dY = get_dY(dZ)
            dX = get_dX(dY)
            return dX

        return Z, backprop_chain

    return forward_chain
```

We can use the `chain` combinator to build a function that runs our `reduce_sum`
and `relu` layers in succession:

```python
chained = chain(reduce_sum, relu)
X = uniform((2, 10, 6)) # (batch_size, sequence_length, width)
dZ = uniform((2, 6))    # (batch_size, width)
Z, get_dX = chained(X)
dX = get_dX(dZ)
assert dX.shape == X.shape
```

Our `chain` combinator works easily because our layers return callbacks. The
callbacks ensure that there is no distinction in API between the outermost layer
and a layer that's part of a larger network. We can see this clearly by
imagining the alternative, where the function expects the gradient with respect
to the output along with its input:

```python
### Problem without callbacks {highlight="15-19"}
def reduce_sum_no_callback(X, dY):
    Y = X.sum(axis=1)
    X_shape = X.shape
    dX = zeros(X_shape)
    dX += dY.reshape((dY.shape[0], 1, dY.shape[1]))
    return Y, dX

def relu_no_callback(inputs, d_outputs):
    mask = inputs >= 0
    outputs = inputs * mask
    d_inputs = d_outputs * mask
    return outputs, d_inputs

def chain_no_callback(layer1, layer2):
    def chain_forward_no_callback(X, dZ):
        # How do we call layer1? We can't, because its signature expects dY
        # as part of its input – but we don't know dY yet! We can only
        # compute dY once we have Y. That's why layers must return callbacks.
        raise CannotBeImplementedError()
```

The `reduce_sum` and `relu` layers are easy to work with, because they don't
introduce any parameters. But networks that don't have any parameters aren't
very useful. So how should we handle them? We can't just say that parameters are
just another type of input variable, because that's not how we want to use the
network. We want the parameters of a layer to be an internal detail – **we don't
want to have to pass in the parameters on each input**.

Parameters need to be handled differently from input variables, because we want
to specify them at different times. We'd like to specify the parameters once
when we create the function, and then have them be an internal detail that
doesn't affect the function's signature. The most direct approach is to
introduce another layer of closures, and make the parameters and their gradients
arguments to the outer layer. The gradients can then be incremented during the
backward pass:

```python
def Linear(W, b, dW, db):
    def forward_linear(X):

        def backward_linear(dY):
            dW += dY.T @ X
            db += dY.sum(axis=0)
            return dY @ W

        return X @ W.T + b, backward_linear
    return forward_linear

n_batch = 128
n_in = 16
n_out = 32
W = uniform((n_out, n_in))
b = uniform((n_out,))
dW = zeros(W.shape)
db = zeros(b.shape)
X = uniform((n_batch, n_in))
Y_true = uniform((n_batch, n_out))

linear = Linear(W, b, dW, db)
Y_out, get_dX = linear(X)

# Now we could calculate a loss and backpropagate
dY = (Y_out - Y_true) / Y_true.shape[0]
dX = get_dX(dY)

# Now we could do an optimization step like
W -= 0.001 * dW
b -= 0.001 * db
dW.fill(0.0)
db.fill(0.0)
```

While the above approach would work, handling the parameters and their gradients
explicitly will quickly get unmanageable. To make things easier, we need to
introduce a `Model` class, so that we can **keep track of the parameters,
gradients, dimensions** and other attributes that each layer might require.

The most obvious thing to do at this point would be to introduce one class per
layer type, with the forward pass implemented as a method on the class. While
this approach would work reasonably well, we've preferred a slightly different
implementation, that relies on composition rather than inheritance. The
implementation of the [`Linear` layer](/docs/api-layers#linear) provides a good
example.

Instead of defining a subclass of `thinc.model.Model`, the layer provides a
function `Linear` that constructs a [`Model` instance](/docs/api-model), passing
in the function `forward` in `thinc.layers.linear`:

```python
def forward(model: Model, X: InputType, is_train: bool):
```

The function receives a `model` instance as its first argument, which provides
you access to the dimensions, parameters, gradients, attributes and layers. The
second argument is the input data, and the third argument is a boolean that lets
layers run differently during training and prediction – an important requirement
for layers like dropout and batch normalization.

As well as the `forward` function, the `Model` also lets you pass in a function
`init`, allowing us to support **shape inference**.

```python
### Linear {highlight="3-4"}
model = Model(
    "linear",
    forward,
    init=init,
    dims={"nO": nO, "nI": nI},
    params={"W": None, "b": None},
)
```

We want to be able to define complex networks concisely, passing in **only
genuine configuration** — we shouldn't have to pass in a lot of variables whose
values are dictated by the rest of the network. The more redundant the
configuration, the more ways the values we pass in can be invalid. In the
example above, there are many different ways for the inputs to `Linear` to be
invalid: the `W` and `dW` variables could be different shapes, the size of `b`
could fail to match the first dimension of `W`, the second dimension of `W`
could fail to match the second dimension of the input, etc. With inputs like
these, there's no way we can expect functions to validate their inputs reliably,
leading to unpredictable logic errors that make the calling code difficult to
debug.

In a network with two `Linear` layers, only one dimension is an actual
hyperparameter. The input size to the first layer and the output size of the
second layer are both **determined by the shape of the data**. The only choice
to make is the number of "hidden units", which will determine the output size of
the first layer and the input size of the second layer. So we want to be able to
write something like this:

```python
model = chain(Linear(nO=n_hidden), Linear())
```

... and have the missing dimensions **inferred later**, based on the input and
output data. In order to make this work, we need to specify initialization logic
for each layer we define. For example, here's the initialization logic for the
`Linear` and `chain` layers:

```python
### Initialization logic
from typing import Optional
from thinc.api import Model, glorot_uniform_init
from thinc.types import Floats2d
from thinc.util import get_width

def init(model: Model, X: Optional[Floats2d] = None, Y: Optional[Floats2d] = None) -> None:
    if X is not None:
        model.set_dim("nI", get_width(X))
    if Y is not None:
        model.set_dim("nO", get_width(Y))
    W = model.ops.alloc2f(model.get_dim("nO"), model.get_dim("nI"))
    b = model.ops.alloc1f(model.get_dim("nO"))
    glorot_uniform_init(model.ops, W.shape)
    model.set_param("W", W)
    model.set_param("b", b)
```