File: index.rst

package info (click to toggle)
azure-data-lake-store-python 1.0.1-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 31,952 kB
  • sloc: python: 4,332; makefile: 192
file content (119 lines) | stat: -rw-r--r-- 4,103 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
azure-datalake-store
====================

A pure-python interface to the Azure Data-lake Storage system, providing
pythonic file-system and file objects, seamless transition between Windows and
POSIX remote paths, high-performance up- and down-loader.

This software is under active development and not yet recommended for general
use.

Installation
------------
Using ``pip``::

    pip install azure-datalake-store

Manually (bleeding edge):

* Download the repo from https://github.com/Azure/azure-data-lake-store-python
* checkout the ``dev`` branch
* install the requirements (``pip install -r dev_requirements.txt``)
* install in develop mode (``python setup.py develop``)
* optionally: build the documentation (including this page) by running ``make html`` in the docs directory.


Auth
----

Although users can generate and supply their own tokens to the base file-system
class, and there is a password-based function in the ``lib`` module for
generating tokens, the most convenient way to supply credentials is via
environment parameters. This latter method is the one used by default in
library. The following variables are required:

* azure_tenant_id
* azure_username
* azure_password
* azure_store_name
* azure_url_suffix (optional)

Pythonic Filesystem
-------------------

The ``AzureDLFileSystem`` object is the main API for library usage of this
package. It provides typical file-system operations on the remote azure
store

.. code-block:: python

    token = lib.auth(tenant_id, username, password)
    adl = core.AzureDLFileSystem(store_name, token)
    # alternatively, adl = core.AzureDLFileSystem()
    # uses environment variables

    print(adl.ls())  # list files in the root directory
    for item in adl.ls(detail=True):
        print(item)  # same, but with file details as dictionaries
    print(adl.walk(''))  # list all files at any directory depth
    print('Usage:', adl.du('', deep=True, total=True))  # total bytes usage
    adl.mkdir('newdir')  # create directory
    adl.touch('newdir/newfile') # create empty file
    adl.put('remotefile', '/home/myuser/localfile') # upload a local file

In addition, the file-system generates file objects that are compatible with
the python file interface, ensuring compatibility with libraries that work on
python files. The recommended way to use this is with a context manager
(otherwise, be sure to call ``close()`` on the file object).

.. code-block:: python

    with adl.open('newfile', 'wb') as f:
        f.write(b'index,a,b\n')
        f.tell()   # now at position 9
        f.flush()  # forces data upstream
        f.write(b'0,1,True')

    with adl.open('newfile', 'rb') as f:
        print(f.readlines())

    with adl.open('newfile', 'rb') as f:
        df = pd.read_csv(f) # read into pandas.

To seamlessly handle remote path representations across all supported platforms,
the main API will take in numerous path types: string, Path/PurePath, and
AzureDLPath. On Windows in particular, you can pass in paths separated by either
forward slashes or backslashes.

.. code-block:: python

    import pathlib  # only >= Python 3.4
    from pathlib2 import pathlib  # only <= Python 3.3

    from azure.datalake.store.core import AzureDLPath

    # possible remote paths to use on API
    p1 = '\\foo\\bar'
    p2 = '/foo/bar'
    p3 = pathlib.PurePath('\\foo\\bar')
    p4 = pathlib.PureWindowsPath('\\foo\\bar')
    p5 = pathlib.PurePath('/foo/bar')
    p6 = AzureDLPath('\\foo\\bar')
    p7 = AzureDLPath('/foo/bar')

    # p1, p3, and p6 only work on Windows
    for p in [p1, p2, p3, p4, p5, p6, p7]:
      with adl.open(p, 'rb') as f:
          print(f.readlines())

Performant up-/down-loading
---------------------------

Classes ``ADLUploader`` and ``ADLDownloader`` will chunk large files and send
many files to/from azure using multiple threads. A whole directory tree can
be transferred, files matching a specific glob-pattern or any particular file.

.. code-block:: python

    # download the whole directory structure using 5 threads, 16MB chunks
    ADLDownloader(adl, '', 'my_temp_dir', 5, 2**24)