File: deduplicate_database_records.md

package info (click to toggle)
gitlab 17.6.5-19
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 629,368 kB
  • sloc: ruby: 1,915,304; javascript: 557,307; sql: 60,639; xml: 6,509; sh: 4,567; makefile: 1,239; python: 406
file content (161 lines) | stat: -rw-r--r-- 6,058 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
stage: Data Stores
group: Database
info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
---

# Deduplicate database records in a database table

This guide describes a strategy for introducing database-level uniqueness constraint (unique index) to existing database tables with data.

Requirements:

- Attribute modifications (`INSERT`, `UPDATE`) related to the columns happen only via ActiveRecord (the technique depends on AR callbacks).
- Duplications are rare and mostly happen due to concurrent record creation. This can be verified by checking the production database table via teleport (reach out to a database maintainer for help).

The total runtime mainly depends on the number of records in the database table. The migration will require scanning all records; to fit into the
post-deployment migration runtime limit (about 10 minutes), database table with less than 10 million rows can be considered a small table.

## Deduplication strategy for small tables

The strategy requires 3 milestones. As an example, we're going to deduplicate the `issues` table based on the `title` column where the `title` must be unique for a given `project_id` column.

Milestone 1:

1. Add a new database index (not unique) to the table via post-migration (if not present already).
1. Add model-level uniqueness validation to reduce the likelihood of duplicates (if not present already).
1. Add a transaction-level [advisory lock](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) to prevent creating duplicate records.

The second step on its own will not prevent duplicate records, see the [Rails guides](https://guides.rubyonrails.org/active_record_validations.html#uniqueness) for more information.

Post-migration for creating the index:

```ruby
def up
  add_concurrent_index :issues, [:project_id, :title], name: INDEX_NAME
end

def down
  remove_concurrent_index_by_name :issues, INDEX_NAME
end
```

The `Issue` model validation and the advisory lock:

```ruby
class Issue < ApplicationRecord
  validates :title, uniqueness: { scope: :project_id }
  before_validation :prevent_concurrent_inserts

  private

  # This method will block while another database transaction attempts to insert the same data.
  # After the lock is released by the other transaction, the uniqueness validation may fail
  # with record not unique validation error.

  # Without this block the uniqueness validation wouldn't be able to detect duplicated
  # records as transactions can't see each other's changes.
  def prevent_concurrent_inserts
    return if project_id.nil? || title.nil?

    lock_key = ['issues', project_id, title].join('-')
    lock_expression = "hashtext(#{connection.quote(lock_key)})"
    connection.execute("SELECT pg_advisory_xact_lock(#{lock_expression})")
  end
end
```

Milestone 2:

1. Implement the deduplication logic in a post deployment migration.
1. Replace the existing index with a unique index.

How to resolve duplicates (e.g., merge attributes, keep the most recent record) depends on the features built on top of the database table. In this example, we keep the most recent record.

```ruby
def up
  model = define_batchable_model('issues')

  # Single pass over the table
  model.each_batch do |batch|
    # find duplicated (project_id, title) pairs
    duplicates = model
      .where("(project_id, title) IN (#{batch.select(:project_id, :title).to_sql})")
      .group(:project_id, :title)
      .having('COUNT(*) > 1')
      .pluck(:project_id, :title)

    value_list = Arel::Nodes::ValuesList.new(duplicates).to_sql

    # Locate all records by (project_id, title) pairs and keep the most recent record.
    # The lookup should be fast enough if duplications are rare.
    cleanup_query = <<~SQL
    WITH duplicated_records AS MATERIALIZED (
      SELECT
        id,
        ROW_NUMBER() OVER (PARTITION BY project_id, title ORDER BY project_id, title, id DESC) AS row_number
      FROM issues
      WHERE (project_id, title) IN (#{value_list})
      ORDER BY project_id, title
    )
    DELETE FROM issues
    WHERE id IN (
      SELECT id FROM duplicated_records WHERE row_number > 1
    )
    SQL

    model.connection.execute(cleanup_query)
  end
end

def down
  # no-op
end
```

NOTE:
This is a destructive operation with no possibility of rolling back. Make sure that the deduplication logic is tested thoroughly.

Replacing the old index with a unique index:

```ruby
def up
  add_concurrent_index :issues, [:project_id, :title], name: UNIQUE_INDEX_NAME, unique: true
  remove_concurrent_index_by_name :issues, INDEX_NAME
end

def down
  add_concurrent_index :issues, [:project_id, :title], name: INDEX_NAME
  remove_concurrent_index_by_name :issues, UNIQUE_INDEX_NAME
end
```

Milestone 3:

1. Remove the advisory lock by removing the `prevent_concurrent_inserts` ActiveRecord callback method.

NOTE:
This milestone must be after a [required stop](required_stops.md).

## Deduplicate strategy for large tables

When deduplicating a large table we can move the batching and the deduplication logic into a [batched background migration](batched_background_migrations.md).

Milestone 1:

1. Add a new database index (not unique) to the table via post migration.
1. Add model-level uniqueness validation to reduce the likelihood of duplicates (if not present already).
1. Add a transaction-level [advisory lock](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) to prevent creating duplicate records.

Milestone 2:

1. Implement the deduplication logic in a batched background migration and enqueue it in a post deployment migration.

Milestone 3:

1. Finalize the batched background migration.
1. Replace the existing index with a unique index.
1. Remove the advisory lock by removing the `prevent_concurrent_inserts` ActiveRecord callback method.

NOTE:
This milestone must be after a [required stop](required_stops.md).