1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
|
# Indexes - Optimize Queries
We just saw how to get some data `WHERE` a **condition** is true. For example, where the hero **name is "Deadpond"**.
If we just create the tables and the data as we have been doing, when we `SELECT` some data using `WHERE`, the database would have to **scan** through **each one of the records** to find the ones that **match**. This is not a problem with 3 heroes as in these examples.
But imagine that your database has **thousands** or **millions** of **records**, if every time you want to find the heroes with the name "Deadpond" it has to scan through **all** of the records to find all the possible matches, then that becomes problematic, as it would be too slow.
I'll show you how to handle it with a database **index**.
The change in the code is **extremely small**, but it's useful to understand what's happening behind the scenes, so I'll show you **how it all works** and what it means.
---
If you already executed the previous examples and have a database with data, **remove the database file** before running each example, that way you won't have duplicate data and you will be able to get the same results.
## No Time to Explain
Are you already a **SQL expert** and don't have time for all my explanations?
Fine, in that case, you can **sneak peek** the final code to create indexes here.
{* ./docs_src/tutorial/indexes/tutorial002_py310.py ln[0] *}
..but if you are not an expert, **continue reading**, this will probably be useful. 🤓
## What is an Index
In general, an **index** is just something we can have to help us **find things faster**. It normally works by having things in **order**. Let's think about some real-life examples before even thinking about databases and code.
### An Index and a Dictionary
Imagine a **dictionary**, a book with definitions of words. 📔 ...not a Python `dict`. 😅
Let's say that you want to **find a word**, for example the word "**database**". You take the dictionary, and open it somewhere, for example in the middle. Maybe you see some definitions of words that start with `m`, like `manual`, so you conclude that you are in the letter `m` in the dictionary.
<img src="/img/tutorial/indexes/dictionary001.drawio.svg">
You know that in the alphabet, the letter `d` for `database` comes **before** the letter `m` for `manual`.
<img src="/img/tutorial/indexes/dictionary002.drawio.svg">
So, you know you have to search in the dictionary **before** the point you currently are. You still don't know where the word `database` is, because you don't know exactly where the letter `d` is in the dictionary, but you know that **it is not after** that point, you can now **discard the right half** of the dictionary in your search.
<img src="/img/tutorial/indexes/dictionary003.drawio.svg">
Next, you **open the dictionary again**, but only taking into account the **half of the dictionary** that can contain the word you want, the **left part of the dictionary**. You open it in the middle of that left part and now you arrive maybe at the letter `f`.
<img src="/img/tutorial/indexes/dictionary004.drawio.svg">
You know that `d` from `database` comes before `f`. So it has to be **before** that. But now you know that `database` **is not after** that point, and you can discard the dictionary from that point onward.
<img src="/img/tutorial/indexes/dictionary005.drawio.svg">
Now you have a **small section of dictionary** to search (only a **quarter** of dictionary can have your word). You take that **quarter** of the pages at the start of the dictionary that can contain your word, and open it in the middle of that section. Maybe you arrive at the letter `c`.
<img src="/img/tutorial/indexes/dictionary005.drawio.svg">
You know the word `database` has to be **after** that and **not before** that point, so you can discard the left part of that block of pages.
<img src="/img/tutorial/indexes/dictionary007.drawio.svg">
You repeat this process **a few more times**, and you finally arrive at the letter `d`, you continue with the same process in that section for the letter `d` and you finally **find the word** `database`. 🎉
<img src="/img/tutorial/indexes/dictionary008.drawio.svg">
You had to open the dictionary a few times, maybe **5 or 10**. That's actually **very little work** compared to what it could have been.
/// note | Technical Details
Do you like **fancy words**? Cool! Programmers tend to like fancy words. 😅
That <abbr title="a recipe, a sequence of predefined steps that achieve a result">algorithm</abbr> I showed you above is called **Binary Search**.
It's called that because you **search** something by splitting the dictionary (or any ordered list of things) in **two** ("binary" means "two") parts. And you do that process multiple times until you find what you want.
///
### An Index and a Novel
Let's now imagine you are reading a **novel book**. And someone told you that at some point, they mention a **database**, and you want to find that chapter.
How do you find the word "*database*" there? You might have to read **the entire book** to find where the word "*database*" is located in the book. So, instead of opening the book 5 or 10 times, you would have to open each of the **500 pages** and read them one by one until you find the word. You might enjoy the book, though. 😅
But if we are only interested in **quickly finding information** (as when working with SQL databases), then reading each of the 500 pages is **too inefficient** when there could be an option to open the book in 5 or 10 places and find what you're looking for.
### A Technical Book with an Index
Now let's imagine you are reading a technical book. For example, with several topics about programming. And there's a couple of sections where it talks about a **database**.
This book might have a **book index**: a section in the book that has some **names of topics covered** and the **page numbers** in the book where you can read about them. And the topic names are **sorted** in alphabetic order, pretty much like a dictionary (a book with words, as in the previous example).
In this case, you can open that book in the end (or in the beginning) to find the **book index** section, it would have only a few pages. And then, you can do the same process as with the **dictionary** example above.
Open the index, and after **5 or 10 steps**, quickly find the topic "**database**" with the page numbers where that is covered, for example "page 253 in Chapter 5". Now you used the dictionary technique to find the **topic**, and that topic gave you a **page number**.
Now you know that you need to find "**page 253**". But by looking at the closed book you still don't know where that page is, so you have to **find that page**. To find it, you can do the same process again, but this time, instead of searching for a **topic** in the **index**, you are searching for a **page number** in the **entire book**. And after **5 or 10 more steps**, you find the page 253 in Chapter 5.
<img src="/img/tutorial/indexes/techbook001.drawio.svg">
After this, even though this book is not a dictionary and has some particular content, you were able to **find the section** in the book that talks about a "**database**" in a **few steps** (say 10 or 20, instead of reading all the 500 pages).
The main point is that the index is **sorted**, so we can use the same process we used for the **dictionary** to find the topic. And then that gives us a page number, and the **page numbers are also sorted**! 😅
When we have a list of sorted things we can apply the same technique, and that's the whole trick here, we use the same technique first for the **topics** in the index and then for the **page numbers** to find the actual chapter.
Such efficiency! 😎
## What are Database Indexes
**Database indexes** are very similar to **book indexes**.
Database indexes store some info, some keys, in a way that makes it **easy and fast to find** (for example sorted), and then for each key they **point to some data somewhere else** in the database.
Let's see a more clear example. Let's say you have this table in a database:
<table>
<tr>
<th>id</th><th>name</th><th>secret_name</th><th>age</th>
</tr>
<tr>
<td>1</td><td>Deadpond</td><td>Dive Wilson</td><td>null</td>
</tr>
<tr>
<td>2</td><td>Spider-Boy</td><td>Pedro Parqueador</td><td>null</td>
</tr>
<tr>
<td>3</td><td>Rusty-Man</td><td>Tommy Sharp</td><td>48</td>
</tr>
</table>
And let's imagine you have **many more rows**, many more heroes. Probably **thousands**.
If you tell the SQL database to get you a hero by a specific name, for example `Spider-Boy` (by using the `name` in the `WHERE` part of the SQL query), the database will have to **scan** all the heroes, checking **one by one** to find all the ones with a name of `Spider-Boy`.
In this case, there's only one, but there's nothing limiting the database from having **more records with the same name**. And because of that, the database would **continue searching** and checking each one of the records, which would be very slow.
But now let's say that the database has an index for the column `name`. The index could look something like this, we could imagine that the index is like an additional special table that the database manages automatically:
<table>
<tr>
<th>name</th><th>id</th>
</tr>
<tr>
<td>Deadpond</td><td>1</td>
</tr>
<tr>
<td>Rusty-Man</td><td>3</td>
</tr>
<tr>
<td>Spider-Boy</td><td>2</td>
</tr>
</table>
It would have each `name` field from the `hero` table **in order**. It would not be sorted by `id`, but by `name` (in alphabetical order, as the `name` is a string). So, first it would have `Deadpond`, then `Rusty-Man`, and last `Spider-Boy`. It would also include the `id` of each hero. Remember that this could have **thousands** of heroes.
Then the database would be able to use more or less the same ideas in the examples above with the **dictionary** and the **book index**.
It could start somewhere (for example, in the middle of the index). It could arrive at some hero there in the middle, like `Rusty-Man`. And because the **index** has the `name` fields in order, the database would know that it can **discard all the previous index rows** and **only search** in the following index rows.
<table>
<tr>
<th>name</th><th>id</th>
</tr>
<tr style="background-color: #F5F5F5; color: #999999;">
<td>Deadpond</td><td>1</td>
</tr>
<tr style="background-color: #F5F5F5; color: #999999;">
<td>Rusty-Man</td><td>3</td>
</tr>
<tr style="background-color: #FFF2CC;">
<td>Spider-Boy</td><td>2</td>
</tr>
</table>
And that way, as with the example with the dictionary above, **instead of reading thousands of heroes**, the database would be able to do a few steps, say **5 or 10 steps**, and arrive at the row of the index that has `Spider-Boy`, even if the table (and index) has thousands of rows:
<table>
<tr>
<th>name</th><th>id</th>
</tr>
<tr style="background-color: #F5F5F5; color: #999999;">
<td>Deadpond</td><td>1</td>
</tr>
<tr style="background-color: #F5F5F5; color: #999999;">
<td>Rusty-Man</td><td>3</td>
</tr>
<tr style="background-color: #D5E8D4;">
<td>✨ Spider-Boy ✨</td><td>2</td>
</tr>
</table>
Then by looking at **this index row**, it would know that the `id` for `Spider-Boy` in the `hero` table is `2`.
So then it could **search that `id`** in the `hero` table using more or less the **same technique**.
That way, in the end, instead of reading thousands of records, the database only had to do **a few steps** to find the hero we wanted.
## Updating the Index
As you can imagine, for all this to work, the index would need to be **up to date** with the data in the database.
If you had to update it **manually** in code, it would be very cumbersome and **error-prone**, as it would be easy to end up in a state where the index is not up to date and points to incorrect data. 😱
Here's the good news: when you create an index in a **SQL Database**, the database takes care of **updating** it **automatically** whenever it's necessary. 😎🎉
If you **add new records** to the `hero` table, the database will **automatically** update the index. It will do the **same process** of **finding** the right place to put the new index data (those **5 or 10 steps** described above), and then it will save the new index information there. The same would happen when you **update** or **delete** data.
Defining and creating an index is very **easy** with SQL databases. And then **using it** is even easier... it's transparent. The database will figure out which index to use automatically, the SQL queries don't even change.
So, in SQL databases **indexes are great**! And are super **easy to use**. Why not just have indexes for everything? .....Because indexes also have a "**cost**" in computation and storage (disk space).
## Index Cost
There's a **cost** associated with **indexes**. 💰
When you don't have an index and add a **new row** to the table `hero`, the database has to perform **1 operation** to add the new hero row at the end of the table.
But if you have an **index** for the **hero names**, now the database has to perform the same **1 operation** to add that row **plus** some extra **5 or 10 operations** in the index, to find the right spot for the name, to then add that **index record** there.
And if you have an index for the `name`, one for the `age`, and one for the `secret_name`, now the database has to perform the same **1 operation** to add that row **plus** some extra **5 or 10 operations** in the index **times 3**, for each of the indexes. This means that now adding one row takes something like **31 operations**.
This also means that you are **exchanging** the time it takes to **read** data for the time it takes to **write** data plus some extra **space** in the database.
If you have queries that get data out of the database comparing each one of those fields (for example using `WHERE`), then it makes total sense to have indexes for each one of them. Because **31 operations** while creating or updating data (plus the space of the index) is much, much better than the possible **500 or 1000 operations** to read all the rows to be able to compare them using each field.
But if you **never** have queries that find records by the `secret_name` (you never use `secret_name` in the `WHERE` part) it probably doesn't make sense to have an index for the `secret_name` field/column, as that will increase the computational and space **cost** of writing and updating the database.
## Create an Index with SQL
Phew, that was a lot of theory and explanations. 😅
The most important thing about indexes is **understanding** them, how, and when to use them.
Let's now see the **SQL** syntax to create an **index**. It is very simple:
```SQL hl_lines="3"
CREATE INDEX ix_hero_name
ON hero (name)
```
This means, more or less:
> Hey SQL database 👋, please `CREATE` an `INDEX` for me.
>
> I want the name of the index to be `ix_hero_name`.
>
> This index should be `ON` the table `hero`, it refers to that table.
>
> The column I want you to use for it is `name`.
## Declare Indexes with SQLModel
And now let's see how to define indexes in **SQLModel**.
The change in code is underwhelming, it's very simple. 😆
Here's the `Hero` model we had before:
{* ./docs_src/tutorial/where/tutorial001_py310.py ln[1:8] hl[6] *}
Let's now update it to tell **SQLModel** to create an index for the `name` field when creating the table:
{* ./docs_src/tutorial/indexes/tutorial001_py310.py ln[1:8] hl[6] *}
We use the same `Field()` again as we did before, and set `index=True`. That's it! 🚀
Notice that we didn't set an argument of `default=None` or anything similar. This means that **SQLModel** (thanks to Pydantic) will keep it as a **required** field.
/// info
SQLModel (actually SQLAlchemy) will **automatically generate the index name** for you.
In this case the generated name would be `ix_hero_name`.
///
## Query Data
Now, to query the data using the field `name` and the new index we don't have to do anything special or different in the code, it's just **the same code**.
The SQL database will figure it out **automatically**. ✨
This is great because it means that indexes are very **simple to use**. But it might also feel counterintuitive at first, as you are **not doing anything** explicitly in the code to make it obvious that the index is useful, it all happens in the database behind the scenes.
{* ./docs_src/tutorial/indexes/tutorial001_py310.py ln[34:39] hl[36] *}
This is exactly the same code as we had before, but now the database will **use the index** underneath.
## Run the Program
If you run the program now, you will see an output like this:
<div class="termy">
```console
$ python app.py
// Some boilerplate output omitted 😉
// Create the table
CREATE TABLE hero (
id INTEGER,
name VARCHAR NOT NULL,
secret_name VARCHAR NOT NULL,
age INTEGER,
PRIMARY KEY (id)
)
// Create the index 🤓🎉
CREATE INDEX ix_hero_name ON hero (name)
// The SELECT with WHERE looks the same
INFO Engine SELECT hero.id, hero.name, hero.secret_name, hero.age
FROM hero
WHERE hero.name = ?
INFO Engine [no key 0.00014s] ('Deadpond',)
// The resulting hero
secret_name='Dive Wilson' age=None id=1 name='Deadpond'
```
</div>
## More Indexes
We are going to query the `hero` table doing comparisons on the `age` field too, so we should **define an index** for that one as well:
{* ./docs_src/tutorial/indexes/tutorial002_py310.py ln[1:8] hl[8] *}
In this case, we want the default value of `age` to continue being `None`, so we set `default=None` when using `Field()`.
Now when we use **SQLModel** to create the database and tables, it will also create the **indexes** for these two columns in the `hero` table.
So, when we query the database for the `hero` table and use those **two columns** to define what data we get, the database will be able to **use those indexes** to improve the **reading performance**. 🚀
## Primary Key and Indexes
You probably noticed that we didn't set `index=True` for the `id` field.
Because the `id` is already the **primary key**, the database will automatically create an internal **index** for it.
The database always creates an internal index for **primary keys** automatically, as those are the primary way to organize, store, and retrieve data. 🤓
But if you want to be **frequently querying** the SQL database for any **other field** (e.g. using any other field in the `WHERE` section), you will probably want to have at least an **index** for that.
## Recap
**Indexes** are very important to improve **reading performance** and speed when querying the database. 🏎
Creating and using them is very **simple** and easy. The most important part is to understand **how** they work, **when** to create them, and for **which columns**.
|