File: datasets.rst

package info (click to toggle)
pytorch-text 0.14.1-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 11,560 kB
  • sloc: python: 14,197; cpp: 2,404; sh: 214; makefile: 20
file content (255 lines) | stat: -rw-r--r-- 5,466 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
torchtext.datasets
==================

.. currentmodule:: torchtext.datasets


.. _datapipes_warnings:

.. warning::

    The datasets supported by torchtext are datapipes from the `torchdata
    project <https://pytorch.org/data/beta/index.html>`_, which is still in Beta
    status. This means that the API is subject to change without deprecation
    cycles. In particular, we expect a lot of the current idioms to change with
    the eventual release of ``DataLoaderV2`` from ``torchdata``.

    Here are a few recommendations regarding the use of datapipes:

    - For shuffling the datapipe, do that in the DataLoader: ``DataLoader(dp, shuffle=True)``.
      You do not need to call ``dp.shuffle()``, because ``torchtext`` has
      already done that for you. Note however that the datapipe won't be
      shuffled unless you explicitly pass ``shuffle=True`` to the DataLoader.

    - When using multi-processing (``num_workers=N``), use the builtin ``worker_init_fn``::

            from torch.utils.data.backward_compatibility import worker_init_fn
            DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True)

      This will ensure that data isn't duplicated across workers.

    - We also recommend using ``drop_last=True``. Without this, the batch sizes
      at the end of an epoch may be very small in some cases (smaller than with
      other map-style datasets). This might affect accuracy greatly especially
      when batch-norm is used. ``drop_last=True`` ensures that all batch sizes
      are equal.

    - Distributed training with ``DistributedDataParallel`` is not yet entirely
      stable / supported, and we don't recommend it at this point. It will be
      better supported in DataLoaderV2. If you still wish to use DDP, make sure
      that:

      - All workers (DDP workers *and* DataLoader workers) see a different part
        of the data. The datasets are already wrapped inside  `ShardingFilter
        <https://pytorch.org/data/main/generated/torchdata.datapipes.iter.ShardingFilter.html>`_
        and you may need to call ``dp.apply_sharding(num_shards, shard_id)`` in order to shard the
        data across ranks (DDP workers) and DataLoader workers. One way to do this
        is to create ``worker_init_fn`` that calls ``apply_sharding`` with appropriate
        number of shards (DDP workers * DataLoader workers) and shard id (inferred through rank
        and worker ID of corresponding DataLoader withing rank). Note however, that this assumes
        equal number of DataLoader workers for all the ranks.
      - All DDP workers work on the same number of batches. One way to do this
        is to by limit the size of the datapipe within each worker to
        ``len(datapipe) // num_ddp_workers``, but this might not suit all
        use-cases.
      - The shuffling seed is the same across all workers. You might need to
        call ``torch.utils.data.graph_settings.apply_shuffle_seed(dp, rng)``
      - The shuffling seed is different across epochs.
      - The rest of the RNG (typically used for transformations) is
        **different** across workers, for maximal entropy and optimal accuracy.

General use cases are as follows: ::


    # import datasets
    from torchtext.datasets import IMDB

    train_iter = IMDB(split='train')

    def tokenize(label, line):
        return line.split()

    tokens = []
    for label, line in train_iter:
        tokens += tokenize(label, line)

The following datasets are currently available. If you would like to contribute
new datasets to the repo or work with your own custom datasets, please refer to `CONTRIBUTING_DATASETS.md <https://github.com/pytorch/text/blob/main/CONTRIBUTING_DATASETS.md>`_ guide.

.. contents:: Datasets
    :local:


Text Classification
^^^^^^^^^^^^^^^^^^^

AG_NEWS
~~~~~~~

.. autofunction:: AG_NEWS

AmazonReviewFull
~~~~~~~~~~~~~~~~

.. autofunction:: AmazonReviewFull

AmazonReviewPolarity
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: AmazonReviewPolarity

CoLA
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: CoLA

DBpedia
~~~~~~~

.. autofunction:: DBpedia

IMDb
~~~~

.. autofunction:: IMDB

MNLI
~~~~

.. autofunction:: MNLI

MRPC
~~~~

.. autofunction:: MRPC

QNLI
~~~~

.. autofunction:: QNLI

QQP
~~~~

.. autofunction:: QQP

RTE
~~~~

.. autofunction:: RTE

SogouNews
~~~~~~~~~

.. autofunction:: SogouNews

SST2
~~~~

.. autofunction:: SST2

STSB
~~~~

.. autofunction:: STSB

WNLI
~~~~

.. autofunction:: WNLI

YahooAnswers
~~~~~~~~~~~~

.. autofunction:: YahooAnswers

YelpReviewFull
~~~~~~~~~~~~~~

.. autofunction:: YelpReviewFull

YelpReviewPolarity
~~~~~~~~~~~~~~~~~~

.. autofunction:: YelpReviewPolarity


Language Modeling
^^^^^^^^^^^^^^^^^

PennTreebank
~~~~~~~~~~~~

.. autofunction:: PennTreebank

WikiText-2
~~~~~~~~~~

.. autofunction:: WikiText2

WikiText103
~~~~~~~~~~~

.. autofunction:: WikiText103


Machine Translation
^^^^^^^^^^^^^^^^^^^

IWSLT2016
~~~~~~~~~

.. autofunction:: IWSLT2016

IWSLT2017
~~~~~~~~~

.. autofunction:: IWSLT2017

Multi30k
~~~~~~~~

.. autofunction:: Multi30k


Sequence Tagging
^^^^^^^^^^^^^^^^

CoNLL2000Chunking
~~~~~~~~~~~~~~~~~

.. autofunction:: CoNLL2000Chunking

UDPOS
~~~~~

.. autofunction:: UDPOS


Question Answer
^^^^^^^^^^^^^^^

SQuAD 1.0
~~~~~~~~~

.. autofunction:: SQuAD1


SQuAD 2.0
~~~~~~~~~

.. autofunction:: SQuAD2


Unsupervised Learning
^^^^^^^^^^^^^^^^^^^^^

CC100
~~~~~~

.. autofunction:: CC100

EnWik9
~~~~~~

.. autofunction:: EnWik9