File: extending.md

package info (click to toggle)
smart-open 7.5.0-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 980 kB
  • sloc: python: 8,054; sh: 90; makefile: 14
file content (149 lines) | stat: -rw-r--r-- 5,232 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# Extending `smart_open`

This document targets potential contributors to `smart_open`.
Currently, there are two main directions for extending existing `smart_open` functionality:

1. Add a new transport mechanism
2. Add a new compression format

The first is by far the more challenging, and also the more welcome.

## New transport mechanisms

Each transport mechanism lives in its own submodule.
For example, currently we have:

- [smart_open.local_file](smart_open/local_file.py)
- [smart_open.s3](smart_open/s3.py)
- [smart_open.ssh](smart_open/ssh.py)
- ... and others

So, to implement a new transport mechanism, you need to create a new module.
Your module must expose the following (see [smart_open.http](smart_open/http.py) for the full implementation):

```python
SCHEMA = ...
"""The name of the mechanism, e.g. s3, ssh, etc.

This is the part that goes before the `://` in a URL, e.g. `s3://`."""

URI_EXAMPLES = ('xxx://foo/bar', 'zzz://baz/boz')
"""This will appear in the documentation of the the `parse_uri` function."""

MISSING_DEPS = False
"""Wrap transport-specific imports in a try/catch and set this to True if
any imports are not found. Seting MISSING_DEPS to True will cause the library
to suggest installing its dependencies with an example pip command.

If your transport has no external dependencies, you can omit this variable.
"""

def parse_uri(uri_as_str):
    """Parse the specified URI into a dict.

    At a bare minimum, the dict must have `schema` member.
    """
    return dict(schema=XXX_SCHEMA, ...)


def open_uri(uri_as_str, mode, transport_params):
    """Return a file-like object pointing to the URI.

    Parameters:

    uri_as_str: str
        The URI to open
    mode: str
        Either "rb" or "wb".  You don't need to implement text modes,
        `smart_open` does that for you, outside of the transport layer.
    transport_params: dict
        Any additional parameters to pass to the `open` function (see below).

    """
    #
    # Parse the URI using parse_uri
    # Consolidate the parsed URI with transport_params, if needed
    # Pass everything to the open function (see below).
    #
    ...


def open(..., mode, param1=None, param2=None, paramN=None):
    """This function does the hard work.

    The keyword parameters are the transport_params from the `open_uri`
    function.

    """
    ...
```

Have a look at the existing mechanisms to see how they work.
You may define other functions and classes as necessary for your implementation.

Once your module is working, register it in the [smart_open.transport](smart_open/transport.py) submodule.
The `register_transport()` function updates a mapping from schemes to the modules that implement functionality for them.

Once you've registered your new transport module, the following will happen automagically:

1. `smart_open` will be able to open any URI supported by your module
2. The docstring for the `smart_open.open` function will contain a section
   detailing the parameters for your transport module.
3. The docstring for the `parse_uri` function will include the schemas and
   examples supported by your module.

You can confirm the documentation changes by running:

    python -c 'help("smart_open")'

and verify that documentation for your new submodule shows up.

### What's the difference between the `open_uri` and `open` functions?

There are several key differences between the two.

First, the parameters to `open_uri` are the same for _all transports_.
On the other hand, the parameters to the `open` function can differ from transport to transport.

Second, the responsibilities of the two functions are also different.
The `open` function opens the remote object.
The `open_uri` function deals with parsing transport-specific details out of the URI, and then delegates to `open`.

The `open` function contains documentation for transport parameters.
This documentation gets parsed by the `doctools` module and appears in various docstrings.

Some of these differences are by design; others as a consequence of evolution.

## New compression mechanisms

The compression layer is self-contained in the `smart_open.compression` submodule.

To add support for a new compressor:

- Create a new function to handle your compression format (given an extension)
- Add your compressor to the registry

For example:

```python
def _handle_xz(file_obj, mode):
    import lzma
    return lzma.LZMAFile(filename=file_obj, mode=mode)


register_compressor('.xz', _handle_xz)
```

There are many compression formats out there, and supporting all of them is beyond the scope of `smart_open`.
We want our code's functionality to cover the bare minimum required to satisfy 80% of our users.
We leave the remaining 20% of users with the ability to deal with compression in their own code, using the trivial mechanism described above.

Documentation
-------------

Once you've contributed your extension, please add it to the documentation so that it is discoverable for other users.
Some notable files:

- setup.py: See the `description` keyword.  Not all contributions will affect this.
- README.rst
- howto.md (if your extension solves a specific problem that doesn't get covered by other documentation)