File: usage-sequences.md

package info (click to toggle)
python-thinc 9.1.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 5,896 kB
  • sloc: python: 17,122; javascript: 1,559; ansic: 342; makefile: 15; sh: 13
file content (236 lines) | stat: -rw-r--r-- 14,055 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: Variable-length sequences
teaser: Dataclasses for ragged, padded, paired and list-based sequences
next: /docs/usage-type-checking
---

Thinc's built-in layers support several ways to **encode variable-length
sequence data**. The encodings are designed to avoid losing information, so you
can compose operations smoothly and easily build hierarchical models over
structured inputs. This page provides a summary of the different formats, their
advantages and disadvantages, and a summary of which built-in layers accept and
return them. There's no restrictions on what objects your own models can accept
and return, so you're free to invent your own data types.

## Background and motivation {#background}

The
[`numpy.ndarray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html)
object represents a multi-dimensional table of data. In the simplest case,
there's only one dimension, or "axis", so the size of the array is equal to its
length:

```python
array1d = numpy.ndarray((10,))
assert array1d.size == 10
assert array1d.shape == (10,)
```

To make a two-dimensional array, we can instead write
`array2d = numpy.ndarray((10, 16))`. This will be a table with 10 rows and 16
columns, and a total size of 160 items. However, the `ndarray` object does not
have a native way to represent data with a **variable number of columns per
row**. If the last row of your data only has 15 items rather than 16, you cannot
create a two-dimensional array with only 159 items, where rows one to nine have
16 columns and the last row has 15 columns. The limitation makes a lot of sense:
the rest of the `numpy` API presents operations defined in terms of
regularly-shaped arrays, and there's often no obvious generalization to
irregularly shaped data.

While you could not represent the 159 items in a two-dimensional array, there's
no reason why you couldn't keep the data together in a flat format, all in one
dimension. You could keep track of the intended number of columns separately,
and reshape the data to do various operations according to your intended
definitions. This is essentially the approach that Thinc takes for
variable-length sequences.

Inputs very often are irregularly shaped. For instance, texts vary in length,
often significantly. You might also want to represent your texts hierarchically:
each word can be seen as a variable-length sequence of characters, each sentence
a variable-length sequence of words, each paragraph a variable-length sequence
of sentences, and each text a variable-length sequence of paragraphs. A single,
padded `ndarray` is a poor choice for this type of hierarchical representation,
as each dimension would need to be padded to its longest item. If the longest
word in your batch is 10 characters, you will need to use 10 characters for
every word. If the longest sentence in your batch has 40 words, every sentence
will need to be 40 words. The inefficiency will be multiplied along each
dimension, so that the vast majority of the final structure is empty space.

Unfortunately, there is no single best solution that is most efficient for every
situation. It depends on the shapes of the data, and the hardware being used. On
GPU devices, it is often better to use **padded representations**, so long as
there is only padding along one dimension. However, on CPU, **denser
representations** are often more efficient, as maintaining parallelism is less
important. Different libraries also introduce different considerations. For
[JAX](https://github.com/google/jax), operations over irregularly-sized arrays
are extremely expensive, as a new kernel will need to be compiled for every
combination of shapes you provide.

Thinc therefore provides a number of **different sequence formats**, with
utility layers that convert between them. Thinc also provides layers that
represent reversible transformations. The
[`with_*` layers](/docs/api-layers#with_array) accept a layer as an argument,
and transform inputs on the way into the layer, and then perform the opposite
transformation on the way out. For instance, the
[`with_padded`](/docs/api-layers#with_padded) wrapper will allow you to
temporarily convert to a [`Padded`](/docs/api-types#padded) representation, for
the scope of the layer being wrapped.

---

## Padded {#padded}

The [`Padded`](/docs/api-types#padded) dataclass represents a **padded batch of
sequences** sorted by descending length. The data is formatted in "sequence
major" format, i.e. the first dimension represents the sequence position, and
the second dimension represents the batch index. The `Padded` type uses three
auxiliary integer arrays, to keep track of the actual sequence lengths and the
original positions, so that the original structure can be restored. The third
auxiliary array, `size_at_t`, allows the padded batch to be sliced to currently
active sequences at different time steps.

Although the underlying data is sequence-major, the `Padded` dataclass supports
**getting items** or **slices along the batch dimension**: you can write
`padded[1:3]` to retrieve a `Padded` object with sequence items one and two. The
`Padded` format is well-suited for LSTM and other RNN models.

```python
### Example
from thinc.api import get_current_ops, Padded

ops = get_current_ops()
sequences = [
    ops.alloc2f(7, 5) + 1,
    ops.alloc2f(2, 5) + 2,
    ops.alloc2f(4, 5) + 3,
]
padded = ops.list2padded(sequences)
assert padded.data.shape == (7, 3, 5)
# Data from sequence 0 is first, as it was the longest
assert padded.data[:, 0] == 1
# Data from sequence 2 is second, and it's padded on dimension 0
assert padded.data[:4, 1] == 3
# Data from sequence 1 is third, also padded on dimension 0
assert padded.data[:2, 2] == 2
# Original positions
assert list(padded.indices) == [0, 2, 1]
# Slices refer to batch index.
assert isinstance(padded[0], Padded)
assert padded[0].data.shape == (7, 1, 5)
```

|                |                                                                                                                                             |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **Operations** | [`Ops.list2padded`](/docs/api-backends#list2padded), [`Ops.padded2list`](/docs/api-backends#padded2list)                                    |
| **Transforms** | [`padded2list`](/docs/api-layers#padded2list), [`list2padded`](/docs/api-layers#list2padded), [`with_padded`](/docs/api-layers#with_padded) |
| **Layers**     | [`LSTM`](/docs/api-layers#lstm), [`PyTorchLSTM`](/docs/api-layers#lstm)                                                                     |

## Ragged {#ragged}

The [`Ragged`](/docs/api-types#ragged) dataclass represents a **concatenated
batch of sequences**. An auxiliary array is used to keep track of the lengths.
The `Ragged` format is memory efficient, and is efficient for some operations.
However, it is not supported directly by most underlying operations.
Per-sequence operations such as sequence transposition and matrix multiplication
are relatively expensive, but Thinc does support custom CPU and CUDA kernels for
more efficient reduction (aka. pooling) operation on ragged arrays.

The `Ragged` format makes it easy to ignore the sequence structure of your data
for some operations, such as word embeddings or feed-forward layers. These
layers do not accept the `Ragged` object directly, but you can wrap the layer
using the [`with_array`](/docs/api-layers#with_array) transform to make them
compatible without requiring copy operations. The `with_array` transform will
pass the underlying array data into the layer, and return the outputs as a
`Ragged` object so that the sequence information remains available to the rest
of your network.

```python
### Example
from thinc.api import get_current_ops, Ragged, Linear, list2ragged

ops = get_current_ops()
sequences = [
    ops.alloc2f(7, 5) + 1,
    ops.alloc2f(2, 5) + 2,
    ops.alloc2f(4, 5) + 3,
]
list2ragged_model = list2ragged()
ragged = list2ragged_model.predict(sequences)
assert ragged.data.shape == (13, 5)
# This will always be true:
assert ragged.data.shape[0] == ragged.lengths.sum()
# Data from sequence 0 is in the first 7 rows, followed by seqs 1 and 2
assert (ragged.data[:7] == 1).all()
assert (ragged.data[7:2] == 2).all()
assert (ragged.data[9:] == 3).all()
# Indexing gets the batch item, and returns a Ragged object
ragged[0].data.shape == (7, 5)
# You can pass the data straight into dense layers
model = Linear(6, 5).initialize()
output = model.predict(ragged.data)
ragged_out = Ragged(output, ragged.lengths)
# Internally, data is reshaped to 2d. The original shape is accessible at the
# the dataXd property.
sequences3d = [ops.alloc3f(5, 6, 7), ops.alloc3f(10, 6, 7)]
ragged3d = list2ragged_model.predict(sequences3d)
ragged3d.data.shape == (15, 13)
ragged3d.dataXd.shape == (15, 6, 7)
```

|                |                                                                                                                                                                                                                                                                    |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Operations** | [`Ops.ragged2list`](/docs/api-backends#ragged2list), [`Ops.list2ragged`](/docs/api-backends#list2ragged), [`Ops.reduce_sum`](/docs/api-backends#reduce_sum), [`Ops.reduce_mean`](/docs/api-backends#reduce_sum), [`Ops.reduce_max`](/docs/api-backends#reduce_sum) |
| **Transforms** | [`with_ragged`](/docs/api-layers#with_ragged), [`ragged2list`](/docs/api-layers#ragged2list), [`list2ragged`](/docs/api-layers#list2ragged)                                                                                                                        |
| **Layers**     | [`reduce_sum`](/docs/api-layers#reduce_sum), [`reduce_mean`](/docs/api-layers#reduce_mean), [`reduce_max`](/docs/api-layers#reduce_max)                                                                                                                            |

## List[ArrayXd] {#list-array}

A list of arrays is often the most convenient input and output format for
sequence data, especially for runtime usage of the model. However, most
mathematical operations require the data to be passed in **as a single array**,
so you will usually need to transform the array list into another format to pass
it into various layers. A common pattern is to use `list2padded` or
`list2ragged` as the first layer of your network, and `ragged2list` or
`padded2list` as the final layer. You could then opt to strip these from the
network during training, so that you can make the transformation just once at
the beginning of training. However, this does mean that you'll be training on
the same batches of data in each epoch, which may lead to lower accuracies.

<!-- TODO: example of network with transforms? -->

|                |                                                                                                                                                                                                                                                                                          |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Operations** | [`Ops.ragged2list`](/docs/api-backends#ragged2list), [`Ops.list2ragged`](/docs/api-backends#list2ragged) , [`Ops.padded2list`](/docs/api-backends#padded2list), [`Ops.list2padded`](/docs/api-backends#list2padded)                                                                      |
| **Transforms** | [`ragged2list`](/docs/api-layers#ragged2list), [`list2ragged`](/docs/api-layers#list2ragged), [`padded2list`](/docs/api-layers#padded2list), [`list2padded`](/docs/api-layers#list2padded), [`with_ragged`](/docs/api-layers#with_ragged), [`with_padded`](/docs/api-layers#with_padded) |
| **Layers**     | [`reduce_sum`](/docs/api-layers#reduce_sum), [`reduce_mean`](/docs/api-layers#reduce_mean), [`reduce_max`](/docs/api-layers#reduce_max)                                                                                                                                                  |

## List[List[Any]] {#nested-list}

Nested lists are a useful format for many types of **hierarchically structured
data**. Often you'll need to write your own layers for these situations, but
Thinc does have a helpful utility tool, the `with_flatten` transform. This
transform can be applied around a layer in your network, and the layer will be
called with a flattened representation of your list data. The outputs are then
repackaged into lists, with arrays divided as needed.

<!-- TODO: example of network with transforms? -->

|                |                                                 |
| -------------- | ----------------------------------------------- |
| **Transforms** | [`with_flatten`](/docs/api-layers#with_flatten) |

<!-- TODO:

## List[Any] {#object-list}

-->

## Array {#array}

Most Thinc layers that work on sequences do **not** expect plain arrays, because
the array does not include any representation of where the sequences begin and
end, which makes the semantics of some operations unclear. For instance, there's
no way to accurately do mean pooling on an array of padded sequences without
knowing where the sequences actually end. Max pooling is often also difficult,
depending on the padding value. If the array represents sequences, you should
**maintain the metadata** to treat it as an intelligible sequence.