The Django ORM is a very powerful tool, and one of the great attractions of Django. It makes writing simple queries trivial, and does a great job of abstracting away the database layer in your application. And sometimes, you shouldn't use it.
In this talk given at a recent meeting of the San Francisco Django Meetup Group, Christophe Pettus, Chief Consultant and CEO of PostgreSQL Experts Inc., discusses situations and efficiencies you can obtain by using direct SQL. Even if you know no SQL at all, this can be a powerful technique to greatly improve your application's performance without sacrificing maintainability or code readability.
Enjoy the video!
Video Transcript
I have an odd but quick question. Who here was born after 1992? Show of hands. God, I feel old. Okay. You are all younger than the first ORM I ever wrote. Early 1992, I wrote an ORM in C++. They’re very, very useful and they solve a very real problem, which is I just want to store this object on disk, is that too much to ask? I mean, this is a really big common problem in programming. And no one likes writing this code. This is awful code. It’s also in C++. I just scared everyone by dragging this out of the crypt. Because you like create a cursor and execute a SELECT and fetch results and you painfully copy every field into a data structure. No one likes writing that stuff. You want to write this. You fetch an object, you call a method on it, you call save on it, and you’re done.
So, I’m a PostgreSQL guy. And so I have an inherent bias into which database engine I like, of course. But I want to talk a little bit about database agnosticism here because some of the things I’m going to talk about are not database agnostic. Database agnosticism comes up often as one of the main reasons to use an ORM. Say, we may be a $5B company with 8.7million users running constantly, and we want to switch our engine out from under us to PostgreSQL to MySQL because that happens all the time. And my response to this is once you want to paint your house and you go into the paint store and they say, “So, which color would you like?” And you say, “Well, [inaudible 0:01:53]. I want all of them. I want to remain color agnostic.”
So if you’re building tooling that is intended to run on an arbitrary installation like you’re writing a Django app or a package, sure, and you need this to run on any database engine, that’s fine. But if you’re building a real life application just assign on your database engine and take advantage of it. Because database agnosticism is a lie. So that out of the way.
I want to talk about two different kinds of ORM avoidance here. One is avoiding Django’s query language. So you’re still using model objects but you’re avoiding writing your queries using Django’s query syntax. And then avoid the model framework entirely. So avoiding the query language. Reasons to do this; queries that generate sub-optimal SQL. There are some queries that are just more easily expressed in SQL or more efficiently. And you want to take advantage or you want to take advantage of SQL features that are not available in the query language.
Django is a lot better. As of 2.1, they used to be about generating SQL. There has been a lot of improvements to it but you can still get some pretty bad SQL out of it, very deep joins. This is a particularly bad one, which is to use the _in operation and that equals a list. It’s like five thousand elements. That is not an optimal construct. Please don’t do that. As it happens, one hundred is in fact the magic number in PostgreSQL. There’s actually a hard wired constant in PostgreSQL for how it does query optimization. And if this list is… if you do in and there’s one hundred or more elements of that, the query will not be as optimal as this if they were fewer. So now you know.
In big filtering exclude chains can generate some pretty [inaudible 0:04:00 knurly] SQL. So here’s an example. We’re going to just do your basic… Why is that C black? I don’t know. So you’re like this is the world’s simplest ecommerce application. We have an Item, we have an OrderLine, edit order, and the OrderLine is one to many from orders to item, many to many from orders to items. You get it. And the query we want to do is we want to find every order where they ordered two or more of something and there nothing discontinued in this order. And what you get is this; which is really not great.
First it does a join on OrderLine which of course that’s fine. And then it’s where the quantity is greater one. And then it like veers off into the weeds and with a loud crash, it generates this in order not… What is going on here?
Now I’m cheating a little bit because I could have said for starters, not equal. Just continue to put it in there and the query we would come up with would be much better. But it’s not hard when you chain together a lot of these things to start generating the SQL. That is not the way I would have written it. Let’s put that way. I can see why I came up with that but because the query generator in Django really doesn’t do a separate optimization [inaudible 0:05:37]. It sort of takes the whole thing and [inaudible 0:05:39 talks] and says what conceptually is the user trying to do? It just tucks all these stuff up. A lot of these queries are just better expressed… Like this filter exclude filter chains are just not great the way Django generates them. You end up with this Python code that looks pretty bad once you have this filter exclude this, this, this, this. At a point you have to say maybe it was just easier to write this thing in SQL. You can often express it much more concisely that way.
Another reason to do this is you want to take advantage of SQL features that aren’t flat out available in the query language, like Common Table Expressions. For example, if you have a tree structure, you can handle this with a single query where you might have to iterate in Python. SQL functions and operators. PostgreSQL has a lot of operators. Time and date functions. JSON path operators, date and time, all sorts of cool stuff. And there’s currently no way of expressing this directly in the Django query language. So when you’re doing this, raw() is your friend. The addition of raw() to the model structure in Django was one of the best things they ever did. Because you can write your custom SQL and still deal with the model objects, so you don’t have to do that horrible copy and copy out code.
Raw() query sets have some limitations like you can’t generated a raw() query set and then tuck a filter on to it or anything like that. But you can also… One of the underused parts of it is you can return values from these queries that are part of the model definition. Here’s a super easy example. You just… Your raw(SELECT *, now() FROM order), and we’ve tucked a now() field on to the model that wasn’t there in the model definition. And you can think of a million ways this could be horribly abused but it’s very handy to have when you want it.
So let’s think about a tree structure. This is Greek/Italian to the GraphQL talk. So we have the world’s simplest graph structure. The only thing it does is have a parent. So it’s a node that has a parent node and there’s the implicit ID that Django creates and all that. And so what we want to do is I want to write a query that gets a person and ID. It returns that node and it returns all of the nodes that are children of that node. And I want a number for the generation, where the top of the root node is zero and down it goes counting up for the number of generations.
Now, you could do this by iterating or write a little recursive function or something like that, but every single one of those is a round trip to the database. And that will get really, really expensive. Or you could do this, nobody panic. It’s your friend. You can do this with the single query. This WITH RECURSIVE construct has two parts which are separated by this union. It does this first. So it just selects this item and that zero is the generation, where the id=1. It then takes that… This is called a recursive common table expressive. It really not recursive, it’s iterative. It then just repeatedly calls this with the results of the previous time around adding to the previous one. So it’s going to shift over and add each generation in turn. So it’s adding that one and the child. And you can think of the matching name there, it’s the previous iteration. I don’t expect anyone to suddenly know how recursive CTEs work just from that description of course, but it does work. And the ultimate SELECT * says get me everything. For example you could also tuck an ORDER BY generation there so that they come back sorted in generation order. So that’s pretty nice. This is a single query. This is one round trip to the database. So all the evaluation happens inside the database engine. So nice and efficient.
The other kind of thing is maybe you don’t want to use the model structure at all. Because does everything have to be a model? The model framework is really powerful. It does all sorts of introspection magic and all sorts of complicated stuff and that costs time. It’s quite expensive to instantiate a model object. And data coming back from the database if you’re coming back using PostgreSQL and using the [inaudible 0:10:33] PG2 database interface, it has to be converted into Python objects twice. The first time the raw data coming back from database is Python object. So it’s Python strings, Python ints, Python this, Python that. And then all that stuff is stuffed into model objects. So this is not cheap and there are times you want to just not have to do that because what you’re doing doesn’t really require instantiating a bunch of model objects.
So here is an example. I cut and pasted that from the lab. Revenue; trailing revenue, dates. What the graphing package in the browser got back was a JSON structure, basically, a JSON structure with two long lists in it, well, three. One for the date, one for that. It didn’t get back model objects, it didn’t get back any of this stuff, it just got back three arrays. You don’t want to create this by saying okay I’m going to have a forloop that queries for this state and gets this back and creates a model object and pulls the results out, and then there’s this and then there’s that. You don’t want to do that. Databases are really good at returning three arrays. They’re all lined up, so just do that.
If what you’re doing is running a query, turning it into a JSON object and feeding it back up to the browser, just return the JSON object directly. Get the raw data out of the query, turn it into JSON structure and push it back. You don’t have to expensively create models in this case. One thing you can do is you can actually create the JSON directly in the database. PostgreSQL has really powerful JSON features. So you can have a single query that just goes, which I should have written here, but it’s really easy to write. Just create the JSON structure, hand it back to the application, and the application just hands it back to the browser in one step.
So what you’ll do is you’ll say this is a great idea, Christophe. I’m just going to run and do this. But wait, stop, before you can do that. If you return the JSON structure to your Python application what’s going to happen is [inaudible 0:12:38] is going to get the JSON object and say, “Oh JSON object, great! I’m here for you. I’ve got you covered. Nobody panic.” It is going to turn this into Python lists for you, into a Python data structure, which you then have to turn around and feed into [inaudible 0:12:52] to turn back into a textual JSON’s structure. So to save all that problem is return it as a text blob. Just cast that JSON structure to a text blob and just feed the text blob back up. I mean raw() is really text. There’s nothing magic about JSON. It’s just a big text string. Because that way you save that horrible pack and unpack step with that.
Another reason is you need to interface the non-model tables. For example, very frequent thing you have is a table that’s a log. User activity, clicks, all these kind of stuff. It doesn’t need a primary key. In fact you may not want it to have a primary key. And if you just arbitrarily create a JSON model off of this it’s going to create an [inaudible 0:13:44] with primary key which will promptly going to overflow because it’s instantly 32bits. And 32bits isn’t that many clicks these days. So you may want to create a table that doesn’t have a standard primary key and insert into it. So don’t bother with the model there. Just do an INSERT statement.
And of course, you frequently enough have legacy or external tables with non-Django-able primary keys. Sometimes these have multiple column primary keys with Django so there’s some support, or it doesn’t have a primary key at all, or it has a type that doesn’t have any easy [inaudible 0:14:26 conversion] into Python.
So that being said, plenty of times you do want to use the ORM. For example there – and I want to be in a moment in praise of the ORM because the ORM is just right for some things. Basic CRUD operations; no reason not to use the ORM. It’s really good at that. So your basic web form that accepts a bunch of data munges it, saves it as a row, use the ORM. Don’t write a bunch of INSERT and DELETE statements unless you’re amusing yourself because it’s really good for that.
This is one place that very frequently when you’re writing an application you have some kind of like a search functionality that you present to the user. And it’s going to say search on name, search on title, search on this, search on that. Building SQL queries textually is horrible. Don’t do that. It’s much, much better to just start with the queries set and tuck on filter, exclude, filter, exclude because you can do that programmatically much easier. And then let Django generate SQL for you. You may want to look at the SQL its generating. Tail the log and see what’s coming out of it, because sometimes it will be this kind of non-optimal SQL. But for doing this interfaces where you’re building queries programmatically, the ORM is great for that. So use that.
So quickly summarizing; don’t be afraid to step outside the ORM. SQL isn’t a bug, it’s a feature. If you’re using SQL database you should learn SQL. I realized that there’s kind of [inaudible 0:16:07] reaction some people have because it’s like dad’s technology. It’s like corsets and [inaudible 0:16:12] tapes. But it still is used to run your application. It is a fundamental part of your data chain. So get familiar with it. You can get a lot of good stuff out it. Do use the ORM for operations where it will be easier. Like this incrementally built queries and things like that. But don’t hesitate to step outside of it because there’s a lot of cool stuff in PostgreSQL that you can only get through with raw queries or you using execute directly. And that’s my talk, thank you.
That’s us. And if you want to learn more about this wonderful SQL thing, come to the meet up. Okay, questions. Yes, sir. Way in the back.
AM: [Inaudible 0:16:56]
Christophe: You can, yeah. Q expressions tend to not be as bad because you’re a little closer to the underlying predicates with the Q expression. The Q expressions tend to map directly into predicates more closely than filter and exclude do. But always look at the data that’s coming out. Because occasionally I’ve written a Q expression that I thought well how could it misinterpret this? I’m basically telling it what to put in the WHERE clause, and I’ll look at it and it’s come up with a wacky thing. Like one of those in sub-select things where there’s really no need to, or stuff like that. Sometimes you can also get Q expressions that just don’t work properly like they return the same object multiple times or things like that because the join doesn’t come up right. I mean, it’s debatable. I’m not sure I would say it’s a bug. It’s just not obvious that’s about the way Q expressions work. Yes, sir.
AM: [Inaudible 0:18:01]
Christophe: Well, there’s two school of thoughts that I run into. One is the manager school and one is the model function school. It’s really a matter of taste. Just pick one or the other and do it consistently. Some people like to encapsulate this in managers. I personally like to encapsulate them in model methods because it’s more closely associated with that. That being said, putting it in a model method, you’re implicitly saying that it has most to do with the model that it’s in, sort of as a cognitive load thing. So there are times that you want to move it out to a utility thing if it doesn’t really map closely to a specific model, especially one of these [inaudible 0:19:00] SQL things. But I personally like putting them in model methods. It’s not the most evil thing in the world to put it in a view function. For example that graph thing I showed. There’s no place also in the code besides that one page that needs to draw that. Sure I could roll it into a function but at that point I’m basically introducing [inaudible 0:19:25] function called to prove a point. In that place I don’t really think I’m being evil by having it execute directly in the view function.
AM: [Inaudible 0:19:37]
Christophe: Basically, that second query runs and runs and runs. You can think of it like a long join chain. It’s an easy way to think of that. It runs the query then says okay I’m going to take the results of that query and apply it to the next iteration. And then I’m going to take the results of that iteration and apply to the next iteration. Eventually, one of those iterations will not come up with the results and that’s how it knows to terminate. It’s perfectly possible to write an infinitely recursive CTE and you’ll know that because it never terminates. It doesn’t catch it because that’s kind of a formula solver type level thing.
AM: [Inaudible 0:20:20]
Christophe: In the sense of… Let me see if I understand the question. The question is how do you update your PostgreSQL code when you update the database? If you’re only using model objects and that is one issue that is worth bringing up, which is the query language will kind of take care of migration, if change’s the underlying thing of migrations for you. I guess if you rename a field you’ll [inaudible 0:21:00] then you get a Python level error as opposed to a database level error. There are problems that you are pushing into the database and we’ll only hear about if you run the query and the database comes back, and [inaudible 0:21:12] will not suddenly turn something red and say by the way you’re using a field that you renamed here. That is an issue. How bad an issue it is, I don’t know how often you rename fields. This is another good reason to make sure that all the code that directly munges a particular model lives inside that model. So at least it’s close. There’s only one place you have to look. If the code that touches a particular model using your SQL is distributed throughout the application, then yeah based on a search replace hunt looking for changes. Does that answer your question?
AM: [Inaudible 0:21:28]
Christophe: Well, it’s just code. You can do whatever you want. I would prefer to put it in the model that its most closely associated with so that you can look for it. That does create the problem of course, and most of these do have joins, and so which model are we talking about here. I generally put it in, of course, the model that… If it returns a model I put it in that one. You can create the situation where if you rename something or change something that causes that query to break that you won’t hear about it until test time. But the answer to that is that’s what continuous integration is for. Yes, sir
AM: [Inaudible 0:22:30]
Christophe: Well, don’t build out of raw text, use parameter substitution and the problem goes away. If you use the parameter substitution facility, you don’t have to worry about SQL injection attacks, because it will never interpret that as structural parts to the SQL. It will only interpret it as values, as parameters. Sometimes, you do want to build… If you’re like me and have no sense of shame, you do like build things that include table names and stuff like that. The only way to do that is with string substitution. And then you just have to make sure that you have a really clear idea where those inputs are coming from. And, just follow the general rule which is all external input is hostile and wants to kill you all the time. But for most of this kind of stuff like, for example just to make it concrete - this - you could drop anything you wanted in there and it will not do SQL injection attack because the only thing it would ever interpret that is as a value. Saying, “one and, print, delete from” kind of thing. It would just be comparing ID with that huge string, and you’d be okay.
It’s unfortunate because the syntax is exactly the same as Python syntax. And unfortunately the % sign means something inside of SQL strings. It’s the wildcard character. Well, that’s 1970s technology.
AM: [Inaudible 0:24:28]
If you’re using [inaudible 0:24:37] write your own queries, for sure. If you’re doing really basic GIS stuff like a simple contains query, you’re probably okay, because how many different ways is it going to write that. But at the point that you’re doing anything complicated at all, just write your own SQL. Because your ability to blow up the system and do horrible things is multiplied by at least one order of magnitude using GIS. And not because [inaudible 0:25:04] is bad, it’s actually an amazing package. But, the number of bad ways to write a GIS query are virtually infinite, and the number of good ways are much smaller. So and the most obvious ways… If you write a naïve – that’s the best word – query that just does basic feel kind of stuff, usually PostgreSQL will crack at it, the right places and things like that. A GIS query written in the most naïve way, you’ll be there for the rest of your life waiting to complete.
So I strongly recommend using either raw or execute directly. Also frequently in GIS, you’re not getting the models back, you’re getting the protocol stuff that hand up to a mapping package and things like that. So the model framework isn’t doing any good.
Anything else? Okay. If you have any questions about PostgreSQL or something like that, come to the meet up. If not in December, we will have it about once a month. And thank you very, very much for hosting me. It was a pleasure.