5 Novelties in PostgreSQL 10 That Will Make You a Happier Developer
In the past few years, we’ve heard a lot about NoSQL databases: Redis, MongoDB, Cassandra, Memchached, or DynamoDB, to name a few. Even though they gained a lot of deserved hype, the good old relational databases are mostly irreplaceable. PostgreSQL, an object-relational (we will get to it later) database management system, is still gaining new fans.
In this post, I will look at some examples of features introduced in the newest version of Postgres, which was released on 5 October. Throughout this post, I will present most of the queries on Postgres 10 on Ubuntu 16. Only the first part about identity columns will show how it used to be done in “the ol’ days” on Postgres 9.6 (on the left-hand side, as below).
The setup process is rather easy. First, you need to install Docker CE according to your system preferences.
After pulling and starting the images, you should see prompts with the Postgres version you are using:
Now we’re ready to go!
1. Identity columns
After switching from an MS SQL Server, this was something I could not get used to: identity columns. In short, these are the columns that store the IDs of your rows – they should be unique and increment automatically.
The syntax is quite similar on both sides and underneath there still lies a sequence. So what’s the big deal? First of all, the new syntax is SQL-compliant, so it should now be easier to run code on different databases. What is more, you can explicitly decide on the behavior while inserting by choosing either ALWAYS or DEFAULT.
Another convenient thing is changing the next value of a sequence, which I used a lot while migrating databases. In the new version, you don’t have to pay attention to the sequence your table is using, because all the magic is done by PG:
If this still does not convince you, there is also one more thing really exciting that I was not aware of before. Suppose we would like to create a copy of the table foo:
Both operations resulted in creating a new table, but only in pg10, a new sequence is created as well. So if you wanted to insert some new rows into the table bar, in version 9.6 you would get id 1000 (it uses foo_id_seq), and in version 10, you will start from 1, which in my opinion is the desired result. Now, if you wanted to drop foo, you could come across some problems:
You cannot drop foo, since its sequence is being used elsewhere. If you decide to do as the hint suggests, you are in for a hard time inserting values into the bar, though. And what if you created a lot of tables by copying some existing ones? You would end up with a lot of mess…
If you want to see some other monstrosities of the old approach, I encourage you to check out this article.
2. Native partitioning
This is a feature that I’m extremely hyped about because of its power and possible applications. I think that we’re still some way from home, but, in comparison to 9.6, it’s a big step forward for Postgres when it comes to ease of use and performance.
What is partitioning all about? Let me give you an example.
Some time ago, my colleagues at 10Clouds wrote about their attempt to predict smog levels by using machine learning. Consider a table that stores all the measurements. A minimal model table would have a timestamp of a sample’s acquisition, site id and type of pollutant (CO2, PM10, PM2.5, and so on). After some time, the amount of data in the table would grow considerably, and the performance would suffer.
» Will Vue.js Become a Giant Like Angular or React? Read our article to find out!
Depending on the type of calculations you perform most frequently, we could try splitting the data. If we decide that we want to have a separate table, e.g. for measurements from each month, this is where partitioning comes in handy. If the application usually queries the database for the records from the last two weeks, then we would have to scan at most two partitions instead of all the records. Of course, we could specify in the query which tables we should scan, but that would be troublesome and ugly, to be honest.
What we want to achieve is:
- Create another level of abstraction – we want to query only one (master) table
- The data themselves should be dispatched to different child tables based on the sample’s timestamp
Here’s how to accomplish this task in Postgres:
- Create a master table.
- Create as many child tables with datetime constraints as needed.
- Create indices, keys, and other constraints on child tables.
- Create a trigger on the master table that will dispatch rows to proper child tables before insert.
As of Postgres 10, point 4 is no longer needed – DBMS will handle it, and, thus, the syntax becomes greatly simplified. A sample implementation might look like this:
Assuming that we created tables with sites and pollutants, this is how we might feed the measurement table and then query it:
The new syntax provides easier partitioning of data in specific ranges or belonging to specific categories. This means that we could also use a different approach and create a separate table for every measurement site. If one of them happens to be deactivated, we can easily detach this partition from the master table and archive it. What is probably the most important part is that the performance of native partitioning improved greatly. I recommend reading depesz’s blog post, where he explores it in depth.
On the other hand, there are still many issues with partitions: you have to set keys and indices for each child table, you can’t set a unique key across different child tables, etc. But we can see that partitioning is being worked on hard, so I hope to see more in the coming releases.
3. Multicolumn statistics
This feature might be a relief for many of you who know your data and happen to see a lot of underestimated execution plans. A hypothetical example: a business has a class of 1000 children that like counting rhymes. I mean, they really LOVE it. For another crazy reason, they want to store the info about which child said which word. You know your craft, so you don’t even ask why and just get down to coding stuff. Here’s how you might want to store and populate it:
As soon as you report you are ready, the business comes back to you and asks you to retrieve when the child under the id 123 that said ‘miny’… This is what Postgres would normally do:
- Estimate what fraction of total rows (p1) would be returned with the predicate child_id=123
- Estimate what fraction of total rows (p2) would be returned with the predicate word = ‘miny’
- Assume that the total number of rows using both predicates would be equal to:
This way of calculating the returned row number might be accurate only when columns in predicates are not correlated. In our case, they quite obviously are, so the estimate would be way lower than the actual result.
But with the introduction of correlated statistics, things might change for better. Let’s see how it’s done now:
In these examples, I will be showing things in reverse order. First, let’s see what the estimate is with the help of the newly created statistic:
We expect 967 rows to be returned. Ok. How would we do it in the old-fashioned way?
Now, let’s do some math:
returned rows=total rows * p1*p2=1000000*(967/1000000) * (202133/1000000)=195,46
This is what we would expect. And what would Postgres say?
Now there is also one more thing – we are only talking about estimates. What’s the real number of returned rows?
To sum up: this example shows that we might achieve a lot if we know our data well. Ignoring the correlation led to a big underestimation – about one order of magnitude! Here it doesn’t look so serious, but consider joining another table that stores data about the children. With such a plan, a nested loop might be considered to be the cheapest, when, in fact, a hash join should be used. If you are interested in seeing what data are stored when collecting multicolumn statistics, just type:
We will not delve into it any deeper, just note that stxkeys correspond to column numbers (you can specify more than two), and two kinds of statistics are collected: stxdistinct and stxdependencies.
Oh, and by the way, did you notice that I ended the previous query with \gx instead of a semicolon? Yup, that’s also a new feature 😉
4. More parallelism!
The first attempts to introduce parallel queries had been made in Postgres 9.6. Since then you could use parallel sequential scans, hash and nested loop join strategies and aggregates. From now on, you will also be able to use parallel merge joins, bitmap heap scans and maybe, most importantly, index and index-only scans.
That’s why you can find the new settings for parallel query usage configuration in the Postgresql.conf file:
- min_parallel_table_scan_size – the minimal size of a table for which parallelism can be triggered, by default 8MB
- min_parallel_index_scan_size – same as the above but applied to indices, by default 512 kB
- max_parallel_workers – the maximum number of parallel workers to be used, defaults to 8. Note that in the previous version of Postgres, max_parallel_workers_per_gather was introduced.
What are these settings for? Well, everything comes at a cost. Setting up a parallel worker sometimes might not be worth anything, because it would use more resources than simply running the query without any parallel workers. Again, a query by default might be considered by a planner to be executed with more than one worker, if the estimated cost is higher than 1000, which corresponds to the setting parallel_setup_cost.
From my short experience with Postgres 10, I have a feeling that the new parallel workers are rather shy creatures. Let’s see them in action. It isn’t really an interesting example, but let’s store values of trigonometric functions, just like that:
First, some aggregations that were introduced in version 9.6:
Easy. The estimated cost is quite big, so the planner decided to activate two additional workers.
Now let’s try to use a parallel index scan.
Nope. What if you helped it a little bit by lowering the cost of adding another worker?
Still nothing, which is a bit surprising for me. The estimated cost is high, the size of the table and the index is well above the threshold… Let’s try a different query, then:
Finally, there you are! Now you can see the Parallel Index Scan in all its glory. I ran this query a few of times and the results were retrieved in 162 ms on average. Now let’s play some more. Let’s try to forcefully use the parallelism and limit additional workers to 0.
First of all, you see that you can execute this query. I want to emphasize it here, because it wasn’t that obvious to me that the number of launched workers might be equal to 0, which means that there are no additional workers.
Secondly, if you limit the number of available processes, the planner might want to use more than it actually can. Finally, the results are faster when you use a parallel index scan, because the results for the above query were returned on average after 260ms. All in all, that’s awesome, even though many of us might not even be aware that this feature exists.
5. Full-text search for JSON and JSONB
Last but not least, full-text search support was added for JSON and JSONB type columns. If you’re familiar with how it works on text columns, this won’t be a breakthrough, but it still is a nice feature that comes out-of-the-box in Postgres 10.
Imagine that you are using an external system for payment management. You store most of the data in normal columns but, just to be safe, you keep responses from that system also in JSON columns. This is what it might look like:
This is a small example, but if the table grows considerably, it might be a good idea to create an index for the JSON column. Here’s how we do it:
Now that we have some data in the table, let’s take a look how we can query it, based on, for instance, insufficient funds response message. Do we use our new index?
Cool, really cool! Note that we needed to convert the word ‘insufficient’ to lexeme ‘insuffici’ to query efficiently.
Ugh… Is this it?
Nope, not even close. As I wrote at the beginning, there is a big hype for Postgres10, as this version introduced a lot of new features. The five novelties above are just the tip of the iceberg, but I chose them, because I believe that I will be using them most often (consciously or not).
So what did I skip?
- logical replication
- quorum commit for synchronous replication
- XML tables
- SCRAM authentication
- renaming of some functions
- \if \elif \else statements in psql
… and many, many more!