Guest post by Michael Herman, author of Real Python for the Web, developer at gun.io.
After writing, Learn SQL, dammit, I worked with a company on optimizing their database, primarily by adding indexes. Now, I must admit, indexing is not my fortay. In fact, I generally shy from adding indexes unless certain queries are causing a significant slowdown in overall performance and all other options have been exhausted (which I will touch on further down). This is a common practice.
Although I would caution against this, it's also quite common to add indexes preimptively if the data volume of a certain column or table will increase significantly.
Indexing must be handled carefully. I've seen developers and database admins add indexes without giving it much thought, which can easily slow down a database even further -- especially if the database has too many indexes.
Let's look at some conventions/best practices that I follow when addressing database indexing as well as a quick example.
Please Note: If you have little SQL and/or relational database experience, it's very difficult to determine if and where indexes should and shouldn't be added. In fact, if your only database experience comes from working through an ORM, then this article may be of little use to you. If that's the case, start with my other article and then take the time to learn how relational algebra and database engines in general work.
If your application is slow, and you suspect your database is the culprit, start with asking yourself the following questions:
Is your database normalized?
Have you tried cacheing?
Are there any current indexes in place? Make sure indexes are created for ALL primary and foreign keys. Most database engines will automatically create the former, while most ORMs (including Django ORM) will handle the latter.
Have you done any performance tests? With Django, for example, you can use the django-debug-toolbar to find slow queries. (New Relic is awesome too.) When testing, start without indexes to find the bottlenecks. Do not assume where the query slowdowns are or will occur. That said, do pay close attention to any queries that use WHERE, ORDER BY, FILTER BY, and any JOINs.
Have you used raw SQL? Before assigning indexes, ALWAYS rewrite your code to use raw SQL. Raw SQL queries are generally faster than using an ORM as they happen at a lower level. Complicated queries should almost always be written in straight SQL. For example:
SELECT d.Name AS 'Department' FROM departments d JOIN employees e ON e.department_id = d.Department_id GROUP BY d.department_id HAVING COUNT(e.employee_id) < 3
If you do decide to add indexes follow these steps:
Capture and record stats from your performance tests based on your production database.
Alter the data model in development.
Do not waste valuable resources by adding indexes to tables with a small amount of data.
When adding indexes, take cardinality into account.
Test. Compare the changes with the benchmarks. Be aware, though, that your application will perfrom much different in development than production.
Refactor (if necessary).
Continue to pull stats and conduct performance testing, comparing the results with the original results before indexes were added.
Going back to the work I did, let's look at one simple, yet often overlooked example:
class ContactForm(models.Model): name = models.CharField(max_length=150) email = models.EmailField(max_length=250) topic = models.CharField(max_length=200) message = models.CharField(max_length=1000) timestamp = models.DateTimeField(auto_now_add=True, default=datetime.datetime.now)
This simple table is used for storing the results of a contact form. At first glance it may look simple, but you need to pay attention to the queries being conducted.
In this case, one query sorted the timestamp by date and then aggregated both the name and email fields. Because there are over 10,000 rows of data, the sorting itself caused a significant bottleneck.
So, by altering the model, with South, I was able to quickly speed up performance:
class ContactForm(models.Model): name = models.CharField(db_index=True,max_length=150) email = models.EmailField(max_length=250) topic = models.CharField(max_length=200) message = models.CharField(max_length=1000) timestamp = models.DateTimeField(db_index=True,auto_now_add=True, default=datetime.datetime.now)
- start with no indexes,
- try to reduce bottlenecks using cacheing and/or adding raw SQL,
- think balance: indexes could slow down database writing (INSERT, UPDATE, or DELETE) , and
- stay away from indexing full text fields (look at the first few characters instead)
Did I miss something? How do you approach indexing?