File: stores.md

package info (click to toggle)
python-maggma 0.70.0-7
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,416 kB
  • sloc: python: 10,150; makefile: 12
file content (131 lines) | stat: -rw-r--r-- 8,720 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Using `Store`

A `Store` is just a wrapper to access data from a data source. That data source is typically a MongoDB collection, but it could also be an Amazon S3 bucket, a GridFS collection, or folder of files on disk. `maggma` makes interacting with all of these data sources feel the same (see the [`Store` interface](#the-store-interface), below). `Store` can also perform logic, concatenating two or more `Store` together to make them look like one data source for instance.

The benefit of the `Store` interface is that you only have to write a `Builder` once. As your data moves or evolves, you simply point it to different `Store` without having to change your processing code.

## Structuring `Store` data

Because `Store` is built around a MongoDB-like query syntax, data that goes into `Store` needs to be structured similarly to MongoDB data. In python terms,
that means **the data in a `Store` must be structured as a `list` of `dict`**,
where each `dict` represents a single record (called a 'document').

```python
data = [{"AM": "sunrise"}, {"PM": "sunset"} ... ]
```

Note that this structure is very similar to the widely-used [JSON](https://en.wikipedia.org/wiki/JSON) format. So structuring your data in this manner
enables highly flexible storage options -- you can easily write it to a `.json`
file, place it in a `Store`, insert it into a Mongo database, etc. `maggma` is
designed to facilitate this.

In addition to being structured as a `list` of `dict`, **every document (`dict`)
must have a key that uniquely identifies it.** By default, this key is the `task_id`, but it can be set to any value you
like using the `key` argument when you instantiate a `Store`.

```python
data = [{"task_id": 1, "AM": "sunrise"}, {"task_id: 2, "PM": "sunset"} ... ]
```

Just to emphasize - **every document must have a `task_id`, and the value of `task_id` must be unique for every document**. The rest of the document structure
is up to you, but `maggma` works best when every document follows a pre-defined
schema (i.e., all `dict` have the same set of keys / same structure).

## The `Store` interface

All `Store` provide a number of basic methods that facilitate querying, updating, and removing data:

- `query`: Standard mongo style `find` method that lets you search the store. See [Understanding Queries](query_101.md) for more details about the query syntax.
- `query_one`: Same as above but limits returned results to just the first document that matches your query. Very useful for understanding the structure of the returned data.
- `count`: Counts documents in the `Store`
- `distinct`: Returns a list of distinct values of a field.
- `groupby`: Similar to query but performs a grouping operation and returns sets of documents.
- `update`: Update (insert) documents into the `Store`. This will overwrite documents if the key field matches.
- `remove_docs`: Removes documents from the underlying data source.
- `newer_in`: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing.
- `ensure_index`: Creates an index for the underlying data-source for fast querying.
- `last_updated`: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is.

!!! Note
    If you are familiar with `pymongo`, you may find the comparison table below
    helpful. This table illustrates how `maggma` method and argument names map
    onto `pymongo` concepts.


    | `maggma`    | `pymongo` equivalent |
    | -------- | ------- |
    | **methods** |
    | `query_one`  | `find_one`    |
    | `query` | `find`     |
    | `count`    | `count_documents`    |
    | `distinct`    | `distinct`    |
    | `groupby`    | `group`    |
    | `update`    | `insert`    |
    | **arguments** |
    | `criteria={}` | `filter={}` |
    | `properties=[]` | `projection=[]` |


## Creating a Store

All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`.

The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents).

`last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into it obeys some schema.

In the example below, we create a `MongoStore`, which connects to a MongoDB database.
To create this store, we have to provide `maggma` the connection details to the
database like the hostname, collection name, and authentication info. Note that
we've set `key='name'` because we want to use that `name` as our unique identifier.

```python
>>> store = MongoStore(database="my_db_name",
                       collection_name="my_collection_name",
                       username="my_username",
                       password="my_password",
                       host="my_hostname",
                       port=27017,
                       key="name",
                    )
```

The specific arguments required to create a `Store` depend on the underlying
format. For example, the `MemoryStore`, which just loads data into memory,
requires no arguments to instantiate. Refer to the [list of Stores](#list-of-stores)
below (and their associated documentation) for specific details.

## Connecting to a `Store`

You must connect to a store by running `store.connect()` before querying or updating the store.
If you are operating on the stores inside of another code it is recommended to use the built-in context manager, e.g.:

```python
with MongoStore(...) as store:
    store.query()
```

This will take care of the `connect()` automatically while ensuring that the
connection is closed properly after the store tasks are complete.

## List of Stores

Current working and tested `Store` include the following. Click the name of
each store for more detailed documentation.

- [`MongoStore`](/maggma/reference/stores/#maggma.stores.mongolike.MongoStore): interfaces to a MongoDB Collection using port and hostname.
- [`MongoURIStore`](/maggma/reference/stores/#maggma.stores.mongolike.MongoURIStore): interfaces to a MongoDB Collection using a "mongodb+srv://" URI.
- [`MemoryStore`](/maggma/reference/stores/#maggma.stores.mongolike.MemoryStore): just a Store that exists temporarily in memory
- [`JSONStore`](/maggma/reference/stores/#maggma.stores.mongolike.JSONStore): builds a MemoryStore and then populates it with the contents of the given JSON files
- [`FileStore`](/maggma/reference/stores/#maggma.stores.file_store.FileStore): query and add metadata to files stored on disk as if they were in a database
- [`GridFSStore`](/maggma/reference/stores/#maggma.stores.gridfs.GridFSStore): interfaces to GridFS collection in MongoDB using port and hostname.
- [`GridFSURIStore`](/maggma/reference/stores/#maggma.stores.gridfs.GridFSURIStore): interfaces to GridFS collection in MongoDB using a "mongodb+srv://" URI.
- [`S3Store`](/maggma/reference/stores/#maggma.stores.aws.S3Store): provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- [`ConcatStore`](/maggma/reference/stores/#maggma.stores.compound_stores.ConcatStore): concatenates several Stores together so they look like one Store
- [`VaultStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.VaultStore): uses Vault to get credentials for a MongoDB database
- [`AliasingStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.AliasingStore): aliases keys from the underlying store to new names
- `SandboxStore: provides permission control to documents via a `_sbxn` sandbox key
- [`JointStore`](/maggma/reference/stores/#maggma.stores.compound_stores.JointStore): joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection
- [`AzureBlobStore`](/maggma/reference/stores/#maggma.stores.azure.AzureBlobStore): provides an interface to Azure Blobs for the storage of large amount of data
- [`MontyStore`](/maggma/reference/stores/#maggma.stores.mongolike.MontyStore): provides an interface to [montydb](https://github.com/davidlatwe/montydb) for in-memory or filesystem-based storage
- [`MongograntStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.MongograntStore): (DEPRECATED) uses Mongogrant to get credentials for MongoDB database