1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
|
# Experimental features
This section contains documentation for experimental Zarr Python features. The features described here are exciting and potentially useful, but also volatile -- we might change them at any time. Take this into account if you consider depending on these features.
## `CacheStore`
Zarr Python 3.1.4 adds `zarr.experimental.cache_store.CacheStore` provides a dual-store caching implementation
that can be wrapped around any Zarr store to improve performance for repeated data access.
This is particularly useful when working with remote stores (e.g., S3, HTTP) where network
latency can significantly impact data access speed.
The CacheStore implements a cache that uses a separate Store instance as the cache backend,
providing persistent caching capabilities with time-based expiration, size-based eviction,
and flexible cache storage options. It automatically evicts the least recently used items
when the cache reaches its maximum size.
Because the `CacheStore` uses an ordinary Zarr `Store` object as the caching layer, you can reuse the data stored in the cache later.
> **Note:** The CacheStore is a wrapper store that maintains compatibility with the full
> `zarr.abc.store.Store` API while adding transparent caching functionality.
## Basic Usage
Creating a CacheStore requires both a source store and a cache store. The cache store
can be any Store implementation, providing flexibility in cache persistence:
```python exec="true" session="experimental" source="above" result="ansi"
import zarr
from zarr.storage import LocalStore
import numpy as np
from tempfile import mkdtemp
from zarr.experimental.cache_store import CacheStore
# Create a local store and a separate cache store
local_store_path = mkdtemp(suffix='.zarr')
source_store = LocalStore(local_store_path)
cache_store = zarr.storage.MemoryStore() # In-memory cache
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024 # 256MB cache
)
# Create an array using the cached store
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
# Write some data to force chunk creation
zarr_array[:] = np.random.random((100, 100))
```
The dual-store architecture allows you to use different store types for source and cache,
such as a remote store for source data and a local store for persistent caching.
## Performance Benefits
The CacheStore provides significant performance improvements for repeated data access:
```python exec="true" session="experimental" source="above" result="ansi"
import time
# Benchmark reading with cache
start = time.time()
for _ in range(100):
_ = zarr_array[:]
elapsed_cache = time.time() - start
# Compare with direct store access (without cache)
zarr_array_nocache = zarr.open(local_store_path, mode='r')
start = time.time()
for _ in range(100):
_ = zarr_array_nocache[:]
elapsed_nocache = time.time() - start
# Cache provides speedup for repeated access
speedup = elapsed_nocache / elapsed_cache
```
Cache effectiveness is particularly pronounced with repeated access to the same data chunks.
## Cache Configuration
The CacheStore can be configured with several parameters:
**max_size**: Controls the maximum size of cached data in bytes
```python exec="true" session="experimental" source="above" result="ansi"
# 256MB cache with size limit
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024
)
# Unlimited cache size (use with caution)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=None
)
```
**max_age_seconds**: Controls time-based cache expiration
```python exec="true" session="experimental" source="above" result="ansi"
# Cache expires after 1 hour
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_age_seconds=3600
)
# Cache never expires
cache = CacheStore(
store=source_store,
cache_store=cache_store,
max_age_seconds="infinity"
)
```
**cache_set_data**: Controls whether written data is cached
```python exec="true" session="experimental" source="above" result="ansi"
# Cache data when writing (default)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
cache_set_data=True
)
# Don't cache written data (read-only cache)
cache = CacheStore(
store=source_store,
cache_store=cache_store,
cache_set_data=False
)
```
## Cache Statistics
The CacheStore provides statistics to monitor cache performance and state:
```python exec="true" session="experimental" source="above" result="ansi"
# Access some data to generate cache activity
data = zarr_array[0:50, 0:50] # First access - cache miss
data = zarr_array[0:50, 0:50] # Second access - cache hit
# Get comprehensive cache information
info = cached_store.cache_info()
print(info['cache_store_type']) # e.g., 'MemoryStore'
print(info['max_age_seconds'])
print(info['max_size'])
print(info['current_size'])
print(info['tracked_keys'])
print(info['cached_keys'])
print(info['cache_set_data'])
```
The `cache_info()` method returns a dictionary with detailed information about the cache state.
## Cache Management
The CacheStore provides methods for manual cache management:
```python exec="true" session="experimental" source="above" result="ansi"
# Clear all cached data and tracking information
import asyncio
asyncio.run(cached_store.clear_cache())
# Check cache info after clearing
info = cached_store.cache_info()
assert info['tracked_keys'] == 0
assert info['current_size'] == 0
```
The `clear_cache()` method is an async method that clears both the cache store
(if it supports the `clear` method) and all internal tracking data.
## Best Practices
1. **Choose appropriate cache store**: Use MemoryStore for fast temporary caching or LocalStore for persistent caching
2. **Size the cache appropriately**: Set `max_size` based on available storage and expected data access patterns
3. **Use with remote stores**: The cache provides the most benefit when wrapping slow remote stores
4. **Monitor cache statistics**: Use `cache_info()` to tune cache size and access patterns
5. **Consider data locality**: Group related data accesses together to improve cache efficiency
6. **Set appropriate expiration**: Use `max_age_seconds` for time-sensitive data or "infinity" for static data
## Working with Different Store Types
The CacheStore can wrap any store that implements the `zarr.abc.store.Store` interface
and use any store type for the cache backend:
### Local Store with Memory Cache
```python exec="true" session="experimental-memory-cache" source="above" result="ansi"
from zarr.storage import LocalStore, MemoryStore
from zarr.experimental.cache_store import CacheStore
from tempfile import mkdtemp
local_store_path = mkdtemp(suffix='.zarr')
source_store = LocalStore(local_store_path)
cache_store = MemoryStore()
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=128*1024*1024
)
```
### Memory Store with Persistent Cache
```python exec="true" session="experimental-local-cache" source="above" result="ansi"
from tempfile import mkdtemp
from zarr.storage import MemoryStore, LocalStore
from zarr.experimental.cache_store import CacheStore
memory_store = MemoryStore()
local_store_path = mkdtemp(suffix='.zarr')
persistent_cache = LocalStore(local_store_path)
cached_store = CacheStore(
store=memory_store,
cache_store=persistent_cache,
max_size=256*1024*1024
)
```
The dual-store architecture provides flexibility in choosing the best combination
of source and cache stores for your specific use case.
## Examples from Real Usage
Here's a complete example demonstrating cache effectiveness:
```python exec="true" session="experimental-final" source="above" result="ansi"
import numpy as np
import time
from tempfile import mkdtemp
import zarr
import zarr.storage
from zarr.experimental.cache_store import CacheStore
# Create test data with dual-store cache
local_store_path = mkdtemp(suffix='.zarr')
source_store = zarr.storage.LocalStore(local_store_path)
cache_store = zarr.storage.MemoryStore()
cached_store = CacheStore(
store=source_store,
cache_store=cache_store,
max_size=256*1024*1024
)
zarr_array = zarr.zeros((100, 100), chunks=(10, 10), dtype='f8', store=cached_store, mode='w')
zarr_array[:] = np.random.random((100, 100))
# Demonstrate cache effectiveness with repeated access
start = time.time()
data = zarr_array[20:30, 20:30] # First access (cache miss)
first_access = time.time() - start
start = time.time()
data = zarr_array[20:30, 20:30] # Second access (cache hit)
second_access = time.time() - start
# Check cache statistics
info = cached_store.cache_info()
assert info['cached_keys'] > 0 # Should have cached keys
assert info['current_size'] > 0 # Should have cached data
print(f"Cache contains {info['cached_keys']} keys with {info['current_size']} bytes")
```
This example shows how the CacheStore can significantly reduce access times for repeated
data reads, particularly important when working with remote data sources. The dual-store
architecture allows for flexible cache persistence and management.
|