1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
|
---
stage: Data Stores
group: Database
info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
---
# Date range partitioning
## Description
The scheme best supported by the GitLab migration helpers is date-range partitioning,
where each partition in the table contains data for a single month. In this case,
the partitioning key must be a timestamp or date column. For this type of
partitioning to work well, most queries must access data in a
certain date range.
For a more concrete example, consider using the `audit_events` table.
It was the first table to be partitioned in the application database. This
table tracks audit entries of security events that happen in the
application. In almost all cases, users want to see audit activity that
occurs in a certain time frame. As a result, date-range partitioning
was a natural fit for how the data would be accessed.
To look at this in more detail, imagine a simplified `audit_events` schema:
```sql
CREATE TABLE audit_events (
id SERIAL NOT NULL PRIMARY KEY,
author_id INT NOT NULL,
details jsonb NOT NULL,
created_at timestamptz NOT NULL);
```
Now imagine typical queries in the UI would display the data in a
certain date range, like a single week:
```sql
SELECT *
FROM audit_events
WHERE created_at >= '2020-01-01 00:00:00'
AND created_at < '2020-01-08 00:00:00'
ORDER BY created_at DESC
LIMIT 100
```
If the table is partitioned on the `created_at` column the base table would
look like:
```sql
CREATE TABLE audit_events (
id SERIAL NOT NULL,
author_id INT NOT NULL,
details jsonb NOT NULL,
created_at timestamptz NOT NULL,
PRIMARY KEY (id, created_at))
PARTITION BY RANGE(created_at);
```
NOTE:
The primary key of a partitioned table must include the partition key as
part of the primary key definition.
And we might have a list of partitions for the table, such as:
```sql
audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01')
```
Each partition is a separate physical table, with the same structure as
the base `audit_events` table, but contains only data for rows where the
partition key falls in the specified range. For example, the partition
`audit_events_202001` contains rows where the `created_at` column is
greater than or equal to `2020-01-01` and less than `2020-02-01`.
Now, if we look at the previous example query again, the database can
use the `WHERE` to recognize that all matching rows are in the
`audit_events_202001` partition. Rather than searching all of the data
in all of the partitions, it can search only the single month's worth
of data in the appropriate partition. In a large table, this can
dramatically reduce the amount of data the database needs to access.
However, imagine a query that does not filter based on the partitioning
key, such as:
```sql
SELECT *
FROM audit_events
WHERE author_id = 123
ORDER BY created_at DESC
LIMIT 100
```
In this example, the database can't prune any partitions from the search,
because matching data could exist in any of them. As a result, it has to
query each partition individually, and aggregate the rows into a single result
set. Because `author_id` would be indexed, the performance impact could
likely be acceptable, but on more complex queries the overhead can be
substantial. Partitioning should only be leveraged if the access patterns
of the data support the partitioning strategy, otherwise performance
suffers.
## Example
### Step 1: Creating the partitioned copy (Release N)
The first step is to add a migration to create the partitioned copy of
the original table. This migration creates the appropriate
partitions based on the data in the original table, and install a
trigger that syncs writes from the original table into the
partitioned copy.
An example migration of partitioning the `audit_events` table by its
`created_at` column would look like:
```ruby
class PartitionAuditEvents < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
def up
partition_table_by_date :audit_events, :created_at
end
def down
drop_partitioned_table_for :audit_events
end
end
```
After this has executed, any inserts, updates, or deletes in the
original table are also duplicated in the new table. For updates and
deletes, the operation only has an effect if the corresponding row
exists in the partitioned table.
### Step 2: Backfill the partitioned copy (Release N)
The second step is to add a post-deployment migration that schedules
the background jobs that backfill existing data from the original table
into the partitioned copy.
Continuing the above example, the migration would look like:
```ruby
class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
disable_ddl_transaction!
restrict_gitlab_migration gitlab_schema: :gitlab_main
def up
enqueue_partitioning_data_migration :audit_events
end
def down
cleanup_partitioning_data_migration :audit_events
end
end
```
This step [queues a batched background migration](../batched_background_migrations.md#enqueue-a-batched-background-migration) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](../batched_background_migrations.md) for more details.
### Step 3: Post-backfill cleanup (Release N+1)
This step must occur at least one release after the release that
includes step (2). This gives time for the background
migration to execute properly in self-managed installations. In this step,
add another post-deployment migration that cleans up after the
background migration. This includes forcing any remaining jobs to
execute, and copying data that may have been missed, due to dropped or
failed jobs.
WARNING:
A required stop must occur between steps 2 and 3 to allow the background migration from step 2 to complete successfully.
Once again, continuing the example, this migration would look like:
```ruby
class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
disable_ddl_transaction!
restrict_gitlab_migration gitlab_schema: :gitlab_main
def up
finalize_backfilling_partitioned_table :audit_events
end
def down
# no op
end
end
```
After this migration completes, the original table and partitioned
table should contain identical data. The trigger installed on the
original table guarantees that the data remains in sync going forward.
### Step 4: Swap the partitioned and non-partitioned tables (Release N+1)
This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully.
Some limitations to this method MUST be handled before, or during, the swap migration:
- Secondary indexes and foreign keys are not automatically recreated on the partitioned table.
- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated
on the partitioned table, since the underlying index will not be present.
- Foreign keys referencing the original non-partitioned table should be updated to reference the
partitioned table. This is not supported in PostgreSQL 11.
- Views referencing the original table are not automatically updated to reference the partitioned table.
```ruby
# frozen_string_literal: true
class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0]
include Gitlab::Database::PartitioningMigrationHelpers
def up
replace_with_partitioned_table :audit_events
end
def down
rollback_replace_with_partitioned_table :audit_events
end
end
```
After this migration completes:
- The partitioned table replaces the non-partitioned (original) table.
- The sync trigger created earlier is dropped.
The partitioned table is now ready for use by the application.
|