File: index.rst

package info (click to toggle)
python-s3fs 2026.2.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 620 kB
  • sloc: python: 6,138; makefile: 190
file content (416 lines) | stat: -rw-r--r-- 13,645 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
S3Fs
====

S3Fs is a Pythonic file interface to S3.  It builds on top of botocore_. The project is hosted on `GitHub <https://github.com/fsspec/s3fs>`_

The top-level class :py:class:`.S3FileSystem` holds connection information and allows
typical file-system style operations like ``cp``, ``mv``, ``ls``, ``du``,
``glob``, etc., as well as put/get of local files to/from S3.

The connection can be anonymous - in which case only publicly-available,
read-only buckets are accessible - or via credentials explicitly supplied
or in configuration files.

Calling ``open()`` on a :py:class:`.S3FileSystem` (typically using a context manager)
provides an :py:class:`.S3File` for read or write access to a particular key. The object
emulates the standard ``File`` protocol (``read``, ``write``, ``tell``,
``seek``), such that functions expecting a file can access S3. Only binary read
and write modes are implemented, with blocked caching.

S3Fs uses and is based upon `fsspec`_.

.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/

Examples
--------

Simple locate and read a file:

.. code-block:: python

   >>> import s3fs
   >>> s3 = s3fs.S3FileSystem(anon=True)
   >>> s3.ls('my-bucket')
   ['my-file.txt']
   >>> with s3.open('my-bucket/my-file.txt', 'rb') as f:
   ...     print(f.read())
   b'Hello, world'

(see also ``walk`` and ``glob``)

Reading with delimited blocks:

.. code-block:: python

   >>> s3.read_block(path, offset=1000, length=10, delimiter=b'\n')
   b'A whole line of text\n'

Writing with blocked caching:

.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(anon=False)  # uses default credentials
   >>> with s3.open('mybucket/new-file', 'wb') as f:
   ...     f.write(2*2**20 * b'a')
   ...     f.write(2*2**20 * b'a') # data is flushed and file closed
   >>> s3.du('mybucket/new-file')
   {'mybucket/new-file': 4194304}

Because S3Fs faithfully copies the Python file interface it can be used
smoothly with other projects that consume the file interface like ``gzip`` or
``pandas``.

.. code-block:: python

   >>> with s3.open('mybucket/my-file.csv.gz', 'rb') as f:
   ...     g = gzip.GzipFile(fileobj=f)  # Decompress data with gzip
   ...     df = pd.read_csv(g)           # Read CSV file with Pandas

Integration
-----------

The libraries ``intake``, ``pandas`` and ``dask`` accept URLs with the prefix
"s3://", and will use s3fs to complete the IO operation in question. The
IO functions take an argument ``storage_options``, which will be passed
to :py:class:`.S3FileSystem`, for example:

.. code-block:: python

   df = pd.read_excel("s3://bucket/path/file.xls",
                      storage_options={"anon": True})

This gives the chance to pass any credentials or other necessary
arguments needed to s3fs.


Async
-----

``s3fs`` is implemented using ``aiobotocore``, and offers async functionality.
A number of methods of :py:class:`.S3FileSystem` are ``async``, for for each of these,
there is also a synchronous version with the same name and lack of a ``_``
prefix.

If you wish to call ``s3fs`` from async code, then you should pass
``asynchronous=True, loop=`` to the constructor (the latter is optional,
if you wish to use both async and sync methods). You must also explicitly
await the client creation before making any S3 call.

.. code-block:: python

    async def run_program():
        s3 = S3FileSystem(..., asynchronous=True)
        session = await s3.set_session()
        ...  # perform work
        await session.close()

    asyncio.run(run_program())  # or call from your async code

Concurrent async operations are also used internally for bulk operations
such as ``pipe/cat``, ``get/put``, ``cp/mv/rm``. The async calls are
hidden behind a synchronisation layer, so are designed to be called
from normal code. If you are *not*
using async-style programming, you do not need to know about how this
works, but you might find the implementation interesting.


Multiprocessing
---------------

When using Python's `multiprocessing`_, the start method must be set to either
``spawn`` or ``forkserver``. ``fork`` is not safe to use because of the open sockets
and async thread used by s3fs, and may lead to
hard-to-find bugs and occasional deadlocks. Read more about the available
`start methods`_.

.. _multiprocessing: https://docs.python.org/3/library/multiprocessing.html
.. _start methods: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

Limitations
-----------

This project is meant for convenience, rather than feature completeness.
The following are known current omissions:

- file access is always binary (although ``readline`` and iterating by line
  are possible)

- no permissions/access-control (i.e., no ``chmod``/``chown`` methods)


Logging
-------

The logger named ``s3fs`` provides information about the operations of the file
system.  To quickly see all messages, you can set the environment variable
``S3FS_LOGGING_LEVEL=DEBUG``.  The presence of this environment variable will
install a handler for the logger that prints messages to stderr and set the log
level to the given value.  More advance logging configuration is possible using
Python's standard `logging framework`_.

.. _logging framework: https://docs.python.org/3/library/logging.html

Errors
------

The ``s3fs`` library includes a built-in mechanism to automatically retry
operations when specific transient errors occur. You can customize this behavior
by adding specific exception types or defining complex logic via custom handlers.

Default Retryable Errors
~~~~~~~~~~~~~~~~~~~~~~~~

By default, ``s3fs`` will retry the following exception types:

- ``socket.timeout``
- ``HTTPClientError``
- ``IncompleteRead``
- ``FSTimeoutError``
- ``ResponseParserError``
- ``aiohttp.ClientPayloadError`` (if available)

Registering Custom Error Types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To include additional exception types in the default retry logic, use the
``add_retryable_error`` function. This is useful for simple type-based retries.

.. code-block:: python

    >>> class MyCustomError(Exception):
            pass
    >>> s3fs.add_retryable_error(MyCustomError)

Implementing Custom Error Handlers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more complex scenarios, such as retrying based on an error message rather than
just the type, you can register a custom error handler using ``set_custom_error_handler``.

The handler should be a callable that accepts an exception instance and returns ``True``
if the error should be retried, or ``False`` otherwise.

.. code-block:: python

    >>> def my_handler(e):
            return isinstance(e, MyCustomError) and "some condition" in str(e)
    >>> s3fs.set_custom_error_handler(my_handler)

Handling AWS ClientErrors
~~~~~~~~~~~~~~~~~~~~~~~~~

``s3fs`` provides specialized handling for ``botocore.exceptions.ClientError``.
While ``s3fs`` checks these against internal patterns (like throttling),
you can extend this behavior using a custom handler. Note that the internal
patterns will still be checked and handled before the custom handler.

.. code-block:: python

    >>> def another_handler(e):
            return isinstance(e, ClientError) and "Throttling" in str(e)
    >>> s3fs.set_custom_error_handler(another_handler)


Credentials
-----------

The AWS key and secret may be provided explicitly when creating an :py:class:`.S3FileSystem`.
A more secure way, not including the credentials directly in code, is to allow
boto to establish the credentials automatically. Boto will try the following
methods, in order:

- ``AWS_ACCESS_KEY_ID``, ``AWS_SECRET_ACCESS_KEY``, and ``AWS_SESSION_TOKEN``
  environment variables

- configuration files such as ``~/.aws/credentials``

- for nodes on EC2, the IAM metadata provider

You can specify a profile using ``s3fs.S3FileSystem(profile='PROFILE')``.
Otherwise ``sf3s`` will use authentication via `boto environment variables`_.

.. _boto environment variables: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables

In a distributed environment, it is not expected that raw credentials should
be passed between machines. In the explicitly provided credentials case, the
method :py:meth:`.S3FileSystem.get_delegated_s3pars` can be used to obtain temporary credentials.
When not using explicit credentials, it should be expected that every machine
also has the appropriate environment variables, config files or IAM roles
available.

If none of the credential methods are available, only anonymous access will
work, and ``anon=True`` must be passed to the constructor.

Furthermore, :py:meth:`.S3FileSystem.current` will return the most-recently created
instance, so this method could be used in preference to the constructor in
cases where the code must be agnostic of the credentials/config used.

S3 Compatible Storage
---------------------

To use ``s3fs`` against an S3 compatible storage, like `MinIO`_ or
`Ceph Object Gateway`_, you'll probably need to pass extra parameters when
creating the ``s3fs`` filesystem. Here are some sample configurations:

For a self-hosted MinIO instance:

.. code-block:: python

   # When relying on auto discovery for credentials
   >>> s3 = s3fs.S3FileSystem(
         anon=False,
         endpoint_url='https://...'
      )
   # Or passing the credentials directly
   >>> s3 = s3fs.S3FileSystem(
         key='miniokey...',
         secret='asecretkey...',
         endpoint_url='https://...'
      )

It is also possible to set credentials through environment variables:

.. code-block:: python

   # export FSSPEC_S3_ENDPOINT_URL=https://...
   # export FSSPEC_S3_KEY='miniokey...'
   # export FSSPEC_S3_SECRET='asecretkey...'
   >>> s3 = s3fs.S3FileSystem()
   # or ...
   >>> f = fsspec.open("s3://minio-bucket/...")


For Storj DCS via the `S3-compatible Gateway <https://docs.storj.io/dcs/getting-started/quickstart-aws-sdk-and-hosted-gateway-mt>`_:

.. code-block:: python

   # When relying on auto discovery for credentials
   >>> s3 = s3fs.S3FileSystem(
         anon=False,
         endpoint_url='https://gateway.storjshare.io'
      )
   # Or passing the credentials directly
   >>> s3 = s3fs.S3FileSystem(
         key='accesskey...',
         secret='asecretkey...',
         endpoint_url='https://gateway.storjshare.io'
      )

For a Scaleway s3-compatible storage in the ``fr-par`` zone:

.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(
      key='scaleway-api-key...',
      secret='scaleway-secretkey...',
      endpoint_url='https://s3.fr-par.scw.cloud',
      client_kwargs={
         'region_name': 'fr-par'
      }
   )

For an OVH s3-compatible storage in the ``GRA`` zone:

.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(
      key='ovh-s3-key...',
      secret='ovh-s3-secretkey...',
      endpoint_url='https://s3.GRA.cloud.ovh.net',
      client_kwargs={
         'region_name': 'GRA'
      },
      config_kwargs={
         'signature_version': 's3v4'
      }
   )


.. _MinIO: https://min.io
.. _Ceph Object Gateway: https://docs.ceph.com/docs/master/radosgw/

Requester Pays Buckets
----------------------

Some buckets, such as the `arXiv raw data
<https://arxiv.org/help/bulk_data_s3>`__, are configured so that the
requester of the data pays any transfer fees.  You must be
authenticated to access these buckets and (because these charges maybe
unexpected) amazon requires an additional key on many of the API
calls. To enable ``RequesterPays`` create your file system as


.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(anon=False, requester_pays=True)


Serverside Encryption
---------------------

For some buckets/files you may want to use some of s3's server side encryption
features. ``s3fs`` supports these in a few ways


.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(
   ...     s3_additional_kwargs={'ServerSideEncryption': 'AES256'})

This will create an s3 filesystem instance that will append the
ServerSideEncryption argument to all s3 calls (where applicable).

The same applies for ``s3.open``.  Most of the methods on the filesystem object
will also accept and forward keyword arguments to the underlying calls.  The
most recently specified argument is applied last in the case where both
``s3_additional_kwargs`` and a method's ``**kwargs`` are used.

The ``s3.utils.SSEParams`` provides some convenient helpers for the serverside
encryption parameters in particular.  An instance can be passed instead of a
regular python dictionary as the ``s3_additional_kwargs`` parameter.


Bucket Version Awareness
------------------------

If your bucket has object versioning enabled then you can add version-aware support
to ``s3fs``.  This ensures that if a file is opened at a particular point in time that
version will be used for reading.

This mitigates the issue where more than one user is concurrently reading and writing
to the same object.

.. code-block:: python

   >>> s3 = s3fs.S3FileSystem(version_aware=True)
   # Open the file at the latest version
   >>> fo = s3.open('versioned_bucket/object')
   >>> versions = s3.object_version_info('versioned_bucket/object')
   # Open the file at a particular version
   >>> fo_old_version = s3.open('versioned_bucket/object', version_id='SOMEVERSIONID')

In order for this to function the user must have the necessary IAM permissions to perform
a GetObjectVersion


Contents
========

.. toctree::
   install
   development
   api
   changelog
   code-of-conduct
   :maxdepth: 2


.. _botocore: https://botocore.readthedocs.io/en/latest/

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`