MySQL Indexing 101

Storm Anning
27 min readJan 15, 2019

--

Indexing is very important, it allow us to:

  • Prevent queries from doing full scans of an entire dataset
  • Access less and therefore lock less rows during queries
  • Speed up queries drastically
  • Prevent sorting records post fetching
  • Impose constraints e.g. data uniqueness
  • Join datasets efficiently
  • And so on…

All in all, good indexing can drastically improve response times.

Common problems that surface when applying indexes are:

  • Having an index for every column in the table
  • Not utilizing composite (multicolumn) indexes
  • Using composite indexes but with ineffective column orderings that prevent the index being fully utilized
  • Creating indexes based on rules of thumb or heuristics, such as indexing all columns that appear in the WHERE clause

It’s very important to be able to reason how indexes work and to choose them based on that understanding. In this article I will provide a gentle intro into the basics of how indexes work, and how they can drastically improve query performance if used correctly.

With that out the way, let’s begin!

Indexes Conceptually

Conceptually the main point of an index is to allow you to find data faster — regardless of its underlying implementation (B-Tree, Hash, R-Tree etc.)

With that in mind here’s a very simple example. If you’ve used indexes before feel free to skip this intro section. Also note that this is only a conceptual example of why sorting data assists lookups and doesn’t directly reflect how indexes are actually stored in MySQL, we’ll cover these details after.

Imagine you’re trying to find your friends number, unfortunately for you, all you have is an unsorted phonebook that simply stores the first names, last names, and numbers in a completely random order, pretty useless right?

No indexes

Without any sorting, finding your friendly neighborhood Zuckerberg’s number could require looking at every single entry in the phonebook. This would be known as a table scan in the database world; an example of when a lookup uses no indexes. It’s highly inefficient, especially when there are billions of people in the world.

Adding a single column index:

If we instead sorted this phonebook by last name, it would at least allow us to easily navigate to the grouping of records were interested in:

This is an example of a single column index on (last_name).

ALTER TABLE `phonebook` ADD INDEX (`last_name`);

In a large phone book however there could still be a lot of people with the same last_name:

There are 513 people with the last name ‘Zuckerberg’, but over 4 million people with the last name ‘Smith’, meaning you could still end up having to scan through 4 million rows.

*Unless of course you’re lucky to be friends with a Mr.Cheesehead who seems to be the only man blessed with this last name*

Clearly a single column index alone isn’t enough to be able to find our friend efficiently.

Why don’t we add an index on each of our columns? We could add multiple single column indexes like so:

ADD index last_name
ADD index first_name
ADD index phone_number

This can be imagined as now having three phone books, each sorted by the column you indexed on. This doesn’t really help you much though! You could simulate an Index Merge (more on this later) by intersecting the results you find from each of the three, but this still isn’t really that effective!

Furthermore: typically MySQL will only use one index, per table, per query, so things are going to be much more performant if we design a single index to satisfy our lookup!

A multiple column (composite) index to the rescue!:

To speed this up, we can sort our list by last name and then by first name.

ALTER TABLE `phonebook` ADD INDEX (`last_name`, `first_name`);

This is whats known as a composite index, because we’re indexing on a compound of multiple columns. In this case a composite of (last_name, first_name).

This index would sort entries as you’d expect in a typical phonebook. By last name, and then by first name. If you performed a look up such as:

SELECT phone_number FROM phonebook where first_name = ‘john’ AND last_name = ‘south’;

The way the index has sorted entries would allow you to easily find your friends number, probably following a path similar to what is highlighted.

By adding several columns into an index we can very quickly narrow down the rows were interested in, avoiding having to scan through large amounts of unsorted rows in a table scan manner.

The most important thing to gleam from this is the way it has been sorted. We no longer need to do any table scanning, because the sorting allows us to be very efficient in narrowing down our data!

We can now easily find our mate Zuck!

Of course it’s easy to see that the order in which these columns are indexed, and thus how they are sorted will limit how effective an index can be for a particular query.

As an example: the index we’ve just discussed (last_name, first_name) would not be beneficial for a query such as:

SELECT * FROM phonebook where first_name = ‘Donald’;

This is because we haven’t specified any criteria for the last_name. The index we built doesn’t sort first names in any logical distribution that would allow us to quickly search through them, so instead, we must fall back to a full table scan. (refer back to the example pic above).

This is just one caveat, and i’ll come back to several of these in a second. So far we’ve learnt that:

  1. Using composite indexes is vital if you’re trying to speed up a particular query
  2. Index order is very VERY important

Examples:

INDEX (last, first)

  • WHERE last = … *** Good
  • WHERE last = … AND first = … *** Good
  • WHERE first = … AND last = … *** Good
  • WHERE first = … *** Index Useless

INDEX (a, b) VS INDEX (b, a)

  • WHERE a = 1 AND b = 3 *** Both work well
  • WHERE b = 2 *** Second only
  • WHERE a = 4 *** First only

INDEX(a), INDEX(b) Is not the same as INDEX(a,b)

But using the two indexes (a),(b) can benefit from:
(SELECT … WHERE A …) UNION (SELECT … WHERE B …)

Now that you get the basic idea of how sorting data in particular ways makes it much easier to perform lookups, i’ll discuss how indexing works in MySQL.

MySQL Architecture

Firstly, MySQLs design supports a wide range of underlying storage engines, here’s a simple picture to illustrate that.

Clients connect to MySQL and issues queries (which may or may not already be cached), these queries are parsed, optimized, and then MySQL through a defined API will interact with a chosen storage engine to retrieve/persist data. Each storage engine has different properties that make it suitable for different use cases. Details around each is best left for another doc.

InnoDB Indexes

A majority of our tables are set up using the storage engine InnoDB which utilize B-Tree based indexes, so we’ll be discussing these exclusively (there are other also many other types of indexes such as: hash indexes, full-text indexes, spatial R-tree indexes, all of which are out of scope).

Feel free to check the engines being used for your tables using:

SHOW TABLE STATUS\G;
SELECT TABLE_NAME,ENGINE FROM information_schema.TABLES;

Below is a visual representation of a B-Tree:

Similar to a binary search, B-trees allow effective lookups of key value pairs by sorting data into a data structure that can be traversed efficiently; avoiding the full scans we talked about earlier.

You start from the root node and follow down towards the leaf nodes. Each node as shown in the example above has lower and upper bounds on keys that guide your search, pruning down the set of leafs you’re interested in.

When we add a new index, a B-Tree is constructed using the respective keys (dictated by which columns you’ve indexed).

When you issue a query to your database the MySQL optimizer will evaluate among all relevant indexes which to use during lookups (the main cost metric is how much data the query will access).

An important thing to note is how Primary Key indexes and Secondary key indexes differ.

Primary Key Indexes

Primary keys use a clustered index: clustered because the actual row data is stored (clustered) together with the key:

In most cases this will be an FBID or a PK that you’ve allocated yourself and will be a unique identifier for a given row in your table. As you can see, at the leaf of this B-Tree the row data is clustered together with the keys.

In this case this clustered index is literally the table itself! and not a separate structure to the table.

Because the clustered index “is” the table in InnoDB, it’s important that you choose a suitable primary key, as this key will be used often, and restructuring can be very expensive.

Non-sequential primary keys could lead to fragmentation issues. Causing page splits and disk fragmentations that lead to overheads in I/O operations.

You should strive to insert data in primary key order when using InnoDB, and you should try to use a clustering key that will give a monotonically increasing values for each new row. This will ensure that rows are inserted in sequential order and will offer better performance for joins using primary keys.

Secondary Indexes:

Obviously there can only be one clustered index — because you can’t store the row data in two places at once; Therefore secondary indexes (any indexes we apply that aren’t the primary) are not clustered, and are in fact separate structures to the table itself.

The leaf nodes for secondary indexes don’t store row data as the Primary Key B-Tree did, instead they simply store Primary Key values which serve as “pointers” to the row data, as you can see below:

This typically means when utilizing a secondary index, InnoDB will first use the B-Tree of the secondary index to retrieve the Primary Key values of the applicable rows, and then after use these values in conjunction with the Primary Key B-tree to fetch the row data!

Because the Primary Key is appended to every secondary index in innoDB, don’t pick huge PK’s. Ideally keep them short so that it doesn’t take up too much memory, and remember all data will be clustered onto this primary key; Therefore a bulky Primary Key will lead to bulky secondary indexes.

Hmm… is this slow?

It should be quite obvious that this additional lookup to follow the Primary Key ‘pointer’ from the secondary index has some overhead, it’s still quick because the primary key is indexed, but this is where the ‘covering index’ optimization comes into play:

Covering Indexes

If your secondary index holds all the data needed to satisfy your query (it ‘covers’ it) then you don’t need to follow the Primary Key values to fetch any additional data!

If we go back to our first example of the phonebook, if we had an index on (`last_name`, `first_name`) and we ran a query such as:

SELECT phone_number FROM phonebook WHERE last_name = ‘wiggum’ AND first_name = ‘chief’

We’d be able to quickly retrieve the PK values for the records, however to fetch the phone_number, we would still need to follow the PK values after in order to fetch row data from the clustered Primary Key index.

However! If we also added our phone_number column to our index, like so: (last_name, first_name, phone_number). The same query is now completely satisfied (covered) by the index, and so no additional lookups are required, making this query run much faster.

In general, an index is said to be covering if it can completely satisfy the data required by the query.

Indexes so far have shown to be useful in speeding up lookups on columns specified in the WHERE clause, and also on data retrieval by covering columns specified in your projections. But that isn’t all!

How Indexes Affect `ORDER and GROUP BY`

Another often overlooked factor is how indexes can be used during ordering and grouping!

Say you’ve only got an index on (last_name) in your phone book, and you run the query:

SELECT * FROM phone_book WHERE last_name = ‘Simpson’ ORDER BY first_name

This will use the last_name index as you’d hope to quickly narrow down the records with that last name, sadly you now need to perform a sort on these resulting records in order to get them sorted by first name. This is because the index didn’t sort the results by first_name in any meaningful way.

This is known as a File Sort: a sort that occurs after the query; it requires fetching the data into a temporary buffer and sorting it before finally returning. This wouldn’t have been needed if the data was already sorted by the index in the way you wanted!

This also applies even If you only wanted to read 5 rows. Say you ran the following:

SELECT last_name FROM phonebook WHERE first_name=’Homer’ ORDER BY last_name LIMIT 5;

You’ll still be fetching thousands of records, sorting them, and only after this, returning the top 5 while discarding the rest of the records you spent time processing.

Clearly had we indexed on both (`last_name`, `first_name`) we wouldn’t have needed to perform this extra sort, because the records would have already been sorted for us in the order we wanted.

And this is another important use case for indexes — Avoiding performing these file sorts.

This also applies to GROUP BY statements. if we ran the following query with this composite index on last_name and first_name:

SELECT * FROM phonebook WHERE last_name = ‘Burns’ GROUP BY first_name

The records would already be sorted by last_name, allowing us to quickly filter down the records with the last_name ‘Burns’. After these results are returned they are also then sorted by first_name due to the second part of the index, and so they are intrinsically already grouped for us! We wouldn’t need to perform any additional sorting at the end which would add further overheads to our query.

Conclusions from this:

Indexes are useful not just for navigating tables quickly, but also for speeding up ORDER BY and GROUP BYs.

Ordering the results by the index only works when the index’s order is exactly the same as the ​ORDER BY clause and all columns are sorted in the same direction (ascending or descending).

INDEX (a,b)

  • ORDER BY a ASC, b ASC *** Good
  • ORDER BY a DESC, B DESC *** Good
  • ORDER BY a ASC, b DESC *** Cannot use index

The most efficient order to retrieve records is the one that the index sorts by. Any busy system should avoid ordering sets of records server-side, especially if paginating and reading subsets of ordered sets, because they’ll access thousands of records each time.

How Indexes Affect `JOINS`

Indexes will also greatly affect the speeds of JOIN operations. Let’s say you have two tables: user, and user_meta_data (which has a column user_id that refers to the user tables PK id).

If both tables have 1000 records and you run the following:

SELECT * FROM users JOIN user_meta_data ON user_meta_data.user_id = users.id;

For EVERY row in the users table MySQL will perform a lookup in the user_meta_data table to join on the id onto user_id.

Without any indexes on user_meta_data (specifically without an index on user_id) MySQL would have to perform a table scan, looking at all 1000 rows for each lookup.

This would require over (1000*1000) = 1 million comparisons, so it’s going to be very slow.

By adding an index on the user_meta_data on the user_id column we can prevent these 1000 comparisons per row, allowing the indexes B-tree to be used for each join lookup!

Deferred Join Optimization

Imagine a case where you have an index on (sex, rating) and you’re trying to run the following query:

SELECT sex, rating, age, height, <cols> FROM profiles WHERE sex=’M’ ORDER BY rating LIMIT 100000, 10;

(where this query is skipping the first 100,000 rows before returning just 10)

This will work as you might expect, it’ll use the index to scan over the 100,010 records (no it isn’t smart enough to simply hop to the offset) and because the index isn’t covering it’ll be pulling the additional column data e.g. age, height using the PK B-Tree as we mentioned earlier. After pointlessly pulling all this data, 100,000 of the 100,010 records are thrown away because we only wanted 10! Clearly pulling the extra data for the 100,000 was unnecessary work.

A good strategy for optimizing such queries is to instead use a deferred join, which is a term for using a covering index to retrieve just the primary key columns of the rows you’ll eventually need; thus avoiding pulling in these extra columns that require wasted overheads.

After you’ve retrieved these PK’s you can simply perform a join back onto the original table to retrieve all desired columns. This helps minimize the amount of work MySQL must do gathering data that it will only throw away later. Going back to our example:

SELECT sex, rating, age, height, <cols> FROM profiles INNER JOIN ( SELECT <primary key cols> FROM profiles WHERE x.sex=’M’ ORDER BY rating LIMIT 100000, 10 ) AS x USING(<primary key cols>);

The sub query highlighted only accesses the primary key columns, thus no additional overheads occur pulling in data not in the index. This is a covering index and it’s going to be respectively quicker. After we’ll have the 10 primary key values we’re interested in. We then perform a join on these results, fetching the extra columns we wanted in the first place. In this case we only suffer the overhead of fetching the extra columns for 10 records, which is a lot quicker than doing it for all 100,010!!

This is especially useful if the column you’re wanting to select is very large e.g. TEXT\BLOB because the overhead will be much more severe.

**A good mindset is to think in terms of data access. Typically the less records you access, the less overheads you’ll incur, and thus the faster your response times will be.

So how can you check if your indexes are actually being used? This is usually easy: use the explain statement! (INFORMATION SCHEMA can also be useful, mentioned later.)

THE EXPLAIN Statement

The explain statement as it’s named is super useful for explaining how your query is executing, and therefore revealing why it could be slow. When you prepend EXPLAIN in front of your query you’ll be provided with information from the MySQL optimizer about the statement execution plan.

  • The indexes it’s considering using.
  • The order in which it plans to join tables.
  • The indexes it actually used.
  • How many rows will be accessed.
  • Whether it used a filesort.
  • And so on…

EXPLAIN works with SELECT, DELETE, INSERT, REPLACE, and UPDATE statements.

You should use the data from the explain statement to drive your investigation. It can indicate which indexes are missing or not being used, as well as many other inefficiencies in your query/schema.

********************** 1. row **********************
id: 1
select_type: SIMPLE
table: categories
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 4
Extra:
1 row in set (0.00 sec)

I won’t go into detail about each of these columns — as this probably deserves it’s own doc; however a few key ones to look at when evaluating your indexes are:

  • Possible keys: shows the keys that can be used by MySQL to find rows from the table, if this is NULL it indicates no useful indexes could be applied.
  • Key: indicates the actual index that MySQL used.
  • Rows: shows the number of records that were examined to produce the output. This is especially relevant during joins.
  • Key_len: longest length of the key that was used ( aka which parts of the composite index are being used) use this to tell how many columns were used from it.
  • Ref: which columns, or constants are compared to the index in order to select rows.

The only other explain column i’ll mention is the ‘Extra’ column, which contains additional info:

What you would like to see in Extra:

  • Using index — MySQL was able to use a covering index
  • Distinct — MySQL stops searching after it found the first matching row
  • Using index condition

What you don’t want to see in Extra:

  • Using file sort — as said extra sorting was required
  • Using temporary — a temp table was needed
  • Using join buffer — tables processed in large batches of rows, instead of index lookups
  • Using where — after fetching rows from storage engine, extra filtering needs to happen for each row. However it’s OK if a very small number of rows were returned.

*It should be clear by now that Indexes need to be designed for the whole query, not just the WHERE clause.

More on Index Order

In a B-Tree the index is sorted first by the leftmost column, then by the next column, and so on, as we’ve already seen. This greatly affects how useful an index will be and it’s one of the most important things to get correct.

How to choose a good order?

This depends on the query that will use the index — you need to choose an index that allows:

  • Your WHERE conditionals to effectively look-up data
  • Your rows to be sorted and grouped in a way beneficial to the query
  • Your JOINS to be efficient etc.

Another thing that may impact ordering choice is selectivity:

Index selectivity is the ratio of the number of distinct indexed values (the ​cardinality) to the total number of rows in the table (#T).

  • It ranges from 1/#T to 1. A unique index has a selectivity of 1, which is as good as it gets.
  • Using the phonebook example: an index of (first_name, last_name) might be less effective than (last_name, first_name) because first names are much less distinct when compared to last_names, meaning it narrow down less results.

Therefore, if order isn’t important between two columns, choose the most selective (narrows down to less records) first.

Range Queries

Similar to what we’ve just said with the prefix rule, the second you use a range query on a column in your index you reach as far as you can utilize this index. This should make sense, if you issued a range query such as:

SELECT * FROM phonebook WHERE last_name LIKE ‘f%’ AND first_name =’Ned’;ADD INDEX (last_name, first_name, phone_number)

This would utilize the first part (last_name) of our index, allowing us to quickly satisfy the range conditional and find all rows with the last_name beginning with ‘f’; however after this, there isn’t any way our B-Tree can be further utilized to quickly filter on first_name.

*If you’re utilizing an index for range queries, try make sure the column you’re performing the range over is ordered last within the index.

Similarly, you can’t use an index fully to perform range queries on two columns for the points already mentioned.

Index Condition Pushdown

That being said there is a technique called index condition pushdown that can aid this, I won’t go into detail about this, as this doc is just meant to be a beginners guide to indexing.

Index push down essentially allows us to push index conditions down to the database engine so that it doesn’t have to return irrelevant rows that would only be filtered out later by MySQL.

This means in some cases we can still use an index past a range condition. It’s best to test and see the results for yourself, but regardless of the situation, you’ll be most effective if you’re able to leave range columns towards the end of your index. ICP will be most useful for when you have no choice.

Conclusions: In general, we can only typically use a leftmost prefix of the index!

  • You may need to have indexes on the same columns in different orders depending on your queries.
  • Try use as many columns as possible up to the first range of the query — after a range no other index column can be used. So put the index that is likely to be ranged right at the end.

What about prefix ranges?

SELECT … WHERE last_name LIKE “%mithers”

This isn’t a range, we can’t use a B-Tree to traverse this obviously. Imagine yourself being at the root of a tree with such a query, do you go left or right? You don’t know! This is exactly why we can’t use an index to support such a query.

Examples:

INDEX (a, b) VS INDEX (b, a)

  • WHERE a = 1 AND b > 3 *** First is better
  • WHERE b = 5 AND b > 7 *** Second is better
  • WHERE a > 1 AND b > 3 *** Each stops after 1st Column
  • WHERE b = 2 *** Second only
  • WHERE b > 2 *** Second only
  • WHERE a = 4 *** First only
  • WHERE a > 4 *** First only

INDEX (a, b, c)

  • WHERE a > 1 AND b =3 AND c = 4 *** Uses only first part of index
  • WHERE a = 1 AND b > 3 AND c = 4 *** Uses first 2 parts of index
  • WHERE a = 1 AND b = 3 AND c = 4 *** Uses all of index

A useful trick: SKIP SCAN

Let’s say you’ve got a table of users with columns: sex, last_name, first_name, age, etc.

If you had an index on (sex, last_name, first_name) and you wanted to run the query:

SELECT first_name, last_name, WHERE last_name = ‘Nahasapeemapetilon’ AND first_name = ‘Apu’;

As you know your index won’t facilitate this. Imagine being at the root of a tree, you can either traverse left for male and continue using the index, or instead right for female. But not both.

One way you can change your query to utilize our index is by instead rewriting it as:

SELECT first_name, last_name, WHERE sex IN(‘male’,’female’) AND last_name = ‘Nahasapeemapetilon’ AND first_name = ‘Apu’;

You can think of this as allowing MySQL to enumerate each value (in this case `sex`) separately, allowing us to dive into the rest of the index from there. Upon finishing enumeration of each value we can simply UNION the results.

In this case: first we use ‘M’ as the sex so we’re able to utilize the whole of the remaining index for the query; after we’d use ‘F’ as our starting point, and again we’d be able to utilize our whole index to satisfy the query. Finally we can simply union the resulting sets of ‘M’ & ‘F’ to get all the rows we’re interested in.

This works well when the column being used in the IN has a relatively small range of values; however It’s efficiency may vary. If in doubt, measure the response times of your index changes, and add new indexes when needed. Definitely don’t force yourself into using a single index that attempts to satisfy everything for the sake of it!

Ambiguous queries

Ambiguous queries will be slow because MySQL cannot utilize indexes to prevent table scans:

INDEX ON phonebook (town, first_name, last_name)

SELECT * FROM phonebook WHERE town <> ‘MPK’ and first_name = ‘Bart’

SELECT * FROM phonebook WHERE town NOT IN (‘MPK’, ‘SEA’) and first_name = ‘Bart’

MySQL won’t be able to use the index for these queries; However the optimizer could easily work with this instead:

SELECT * FROM phonebook WHERE town IN(‘SPRINGFIELD’, ‘SEA’, ‘NYC’) and first_name = ‘Bart’

In general all of the below could leave an index unusable:

  • !=
  • <>
  • NOT LIKE, NOT IN…
  • NOT EXISTS ( SELECT * … ) — essentially a LEFT JOIN, often efficient
  • NOT (expression)

Don’t use functions in your queries

MySQL generally can’t use indexes on columns unless the columns are isolated in the query. So don’t use functions or expressions in your queries e.g:

The expression on the left should be a column e.g. <column> <operator> <value>

The second you do func(column) <operator> <value> you can’t use the index and a full table scan will occur.

Examples:

  • WHERE id+3 = 4; *** BAD
  • Bad: WHERE start_date + INTERVAL 1 YEAR > NOW() *** BAD
  • WHERE YEAR(start_date) = 2015 AND MONTH(start_date) = 1 *** BAD
  • Where number +0 = 5; *** BAD
  • WHERE func(number) = n; *** BAD
  • WHERE number = 5+4; *** GOOD
  • WHERE number = func(n); *** GOOD
  • WHERE start_date > NOW() — INTERVAL 1 YEAR *** GOOD
  • WHERE start_date BETWEEN “2015–01–01” AND “2015–01–31” *** GOOD

Redundant indexes

Over indexing can hurt performance due to overheads

The drawback of having too many indexes is the maintenance cost.

Adding new indexes might have a performance impact for INSERT, UPDATE, and DELETE operations, especially if a new index causes you to hit memory limits.

Every time you perform a write on a table, the indexes will need to be maintained. Furthermore when you run a query, each index must be considered by the MySQL Optimizer.

  • If there is an index on (A, B), adding another index (A) would be redundant because it is a prefix of the first index. That is, the index on (A, B) can already be used as an index for column A alone.
  • If there is an index on (A, PK_ID). The PK_ID column as you already know is already included if you’re using InnoDB, so it’s redundant, luckily it won’t add it twice so you are safe to do it, you just don’t need too.

The only time we want redundant indexes is when extending an existing index makes it much larger and thus reduces performance!

You can easily see which indexes are redundant, especially those that have never been used, by querying the INFORMATION_SCHEMA database.

That being said don’t be afraid to add indexes that will actually be used! In a read heavy application the costs will be negligible. It’s best to test whether an index is beneficial out for yourself!

Index Merge

When i said MySQL only uses one index, per query, per table, most of the times this is true! However, sometimes MySQL does in fact have the capabilities to use multiple single column indexes.

For example: it could use several indexes to fetch primary key values, and then perform a union or intersection depending on the query. These are useful in situations where you cannot form a suitable multicolumn index e.g. in the case of several ‘OR’ conditionals in your query.

However it is rare that these are actually used, and if you are able to form a suitable multicolumn index then you should, because this will typically outperform a merge index:

https://www.percona.com/blog/2009/09/19/multi-column-indexes-vs-index-merge/

Other uses cases: Constraint Indexes

DO NOT ENFORCE UNIQUENESS AT APPLICATION LEVEL

Use unique indexes when you need something to be unique, doing this in application code is always going to cause issues further down the line!

  • Hidden code paths bypassing validations, especially with FBcode, WWW etc. developers need to be aware to run these checks before mutating the database.
  • People manipulating the DB directly.
  • Application code running in parallel in different transactions may have inconsistent views of the database, thus making enforcing constraints very difficult.
  • And so on..

Allowing Soft Deletes

If you need to support soft deletes you can achieve this by either:

  • Prefixing a NULLABLE `active` tinyint(1) (default 1) column to your unique index.
  • Adding a `deleted_time stamp` (default 0) within the unique index.
  • Adding a UUID deletion token (default some constant) within the unique index.
  • Create a seperate ‘deleted_’ table and move records there when they are deleted.

Trades Offs

  • The advantage of the `active` column is that it’s very simply to configure. Note: Due to the way InnoDB works it won’t treat NULLS as duplicates, which is exactly why this method works. Non deleted records will have a value e.g. ‘1’ that prevent duplicates. Deleted records however will have a value of NULL allowing several ‘deleted’ duplicates.
  • The advantage of the delete_time_stamp / UUID token is that they don’t need to be prefixes, we can put them at the end of the index ordering. This could provide better performance when searching because our index columns will be initially more selective than the `active` column, allowing potentially faster inserts (each time we try and insert, this index will need to be traversed to check for a duplicate).
  • The downside to the UUID token is having to generate it, and due to it’s random distribution it could cause fragmentation overheads later on; furthermore its size may cause overheads.
  • The deleted_time_stamp provides good performance with little effort.
  • Creating a separate table provides great ‘read’ performance: reducing table sizes, allowing you to use a normal unique index, and allowing everyday searches to scan less records. The downside being you have to actually move these records between tables.
  • My recommendation is to use the deleted_time_stamp approach and then if you find a performance need later on, you could move to using a separate `deleted_table`.

So what columns should I Index?

You should realize from all that we’ve discussed that it really depends…

  • What columns you’re going to query
  • What JOINs you’ll perform
  • What ORDER/GROUP BYs etc.
  • Find your slow queries (from slow query log) if you don’t have any in mind, and see what indexes might speed them up! Use EXPLAIN to see what indexes it’s currently using. Then use the lessons you have learnt to apply better indexing. Use EXPLAIN again to ensure your new index was effective.
  • Unsure about which column order? Then use most selective, unless you need a range or order etc, those should typically always go to the end.
  • Don’t index a column with low selectivity e.g. ‘sex’ on it’s own. If WHERE gender = ‘F’ occurs >20% of results, the index might not be that effective (the optimizer may favor a table scan instead). In these cases simply run test on response times to guide your decisions.
  • Query the INFORMATION SCHEMA database to see stats on your indexes and tables, these can really help you guide performance tuning!
  • Use the unused index script linked earlier to find index failures
  • Index design is not separate from table design. They go hand in hand, and you must design everything to suit your queries.

Star system

A good way to measure the quality of an index is by using the star system, the goal being to achieve all three stars:

  1. The index earns one star if it places relevant rows adjacent to each other, or at least close to each other as possible. This minimizes the thickness of the index slice that must be scanned. The goal is to narrow down the amount of rows that need to be scanned as much as possible. To gain this star, you could pick the columns from all equal predicates in the WHERE clause. Many people think the purpose of an index is to find individual rows, but finding single rows leads to random disk operations (which is slow). It’s much better to find groups of rows, all or most of which are interesting, rather than to find them one at a time. Therefore a good index will group keys together in such a way that this is achieved effectively.
  2. A second star is earned if the index sorts the rows in the order the query needs, avoiding additional file sorts.
  3. And a final star is earned if the index covers all the columns needed for the query, aka it’s a covering index. All the columns in the SELECT appear in the index.

Of course depending on your query it might not be possible to achieve all three!

Read more here: Relational Database Index Design and the Optimizers — Lahdenmaki, Leach

Other considerations

  • When rows get into the millions you must consider data types & indexes much closer. A lighter weight data type can save a lot of overhead for a large table! Especially if it allows more of it to be stored in memory.
  • When rows get into billions, summary tables (e.g. storing counts etc.) can be very effective.
  • Deleted flags etc. with low cardinality may perform badly or just not be used at all. A potential workaround could be to move deleted rows into another table.
  • If you need to index on a large string column etc, considering only indexing on a prefix or hash of that column, so to avoid overly large indexes.

If you’re using an ORM

Look at the queries it’s generating first before assuming what kind of indexes may help, the queries it generates may often surprise you and leave your index redundant.

As always the downside to an ORM is this abstraction from MySQL which can cause a slew of performance related problems, especially when things need to scale.

How will MySQL choose an index?

As said before, the MySQL optimizer evaluates between a set of indexes for each query to determine which one will be most useful based on a cost metric: this typically being how much data the query will access.

For single table queries the set of indexes evaluated could be any index including columns appearing in the WHERE clause. For joins on multiple tables the MySQL optimizer will try to figure out which table can be narrowed down the most by the passed in predicates (just like in the single table query), and then it will calculate how many rows the whole join would have to scan based on table statistics (e.g. table size/indexes). When obvious query plans do not exists, a change in these statistics may result in ambiguous plan choices.

You can see the indexes your query considers using EXPLAIN as mentioned earlier.

Testing an index

  • When testing out your indexes make sure you have copied a reasonable sample of production data to run tests on. Testing on fake data with different distributions to production data will not reflect well.
  • Test your index with all the reasonable arguments you’re trying to capture, different query conditions can use very different indexes.
  • Be aware that there are various cases where the engine may do something better than expected.
  • Sometimes you may not be able to find a perfect index for your query, in which case you could consider rewriting your query.

Finally, as always — never take something for granted based on your assumptions. Always test out a new index, or test out changing a query to improve performance. Use actual measured response times to guide your decisions!

There are also many other factors that contribute to performance: e.g. how you write your queries, design your schemas etc; However these topics are better left for another article.

Thanks for reading!

--

--