PostgreSQL – Explaining the EXPLAIN command part 2

The previous article described the PostgreSQL planner, the EXPLAIN command, and how to read its output. In this article, we will discuss various “nodes” that you can stumble upon when working with the EXPLAIN command.

SCANS

There are multiple ways in which a table can be scanned for data. Depending on the metadata, statistics, presence/absence of indexes and other factors, the query planner gets to decide which type of scan will be most optimal for a query.

Seq Scan/Index Scan

Since those have already been covered in the previous article, I’m not going to explain what they are again.

However, since in PostgreSQL 11 concurrent query planning is enabled by default, it’s worth mentioning that these types of table scans don’t return rows in any particular order unless the user explicitly adds an ordering/sorting clause.

When only 1 worker (thread) executes the query, the result set will (most likely) be ordered the same way on every execution. However, if the planner chooses to allocate more than 1 worker for the query execution, the order of the result will be unpredictable.

I mentioned concurrent planning in this section, but we’re not going to cover it in this article, as parallel queries are complicated and completely out of scope for this article. They deserve a separate article of their own.

Index Scan Backward

This is the same as an Index Scan, however, as the title suggests, it scans in reverse order. The planner usually chooses this type of scan when the order by desc clause is present in the query.

Take a look at the following plan as an example:

LIMIT  (cost=0.14..4.04 ROWS=10 width=206) (actual TIME=0.011..0.023 ROWS=10 loops=1)
   ->  INDEX Scan Backward USING pg_class_oid_index ON pg_class  (cost=0.14..37.54 ROWS=97 width=206) (actual TIME=0.009..0.022 ROWS=10 loops=1)
         INDEX Cond: (oid < 1247::oid)
 Total runtime: 0.121 ms
(4 ROWS)

Index Only Scan

This type of scan happens when the query selects data from indexed columns only. The idea behind this scan is the fact that PostgreSQL does not have to check any table data and extract everything that is required from the index, which is a huge performance improvement. This type of scan is also much faster than an Index Scan too! However, for this type of scan to work, the data from the index has to be “recent”, which means that the table must be well vacuumed. Index Only Scan was the major change in PostgreSQL 9.2.

Take a look at the following table:

CREATE TABLE test (id serial PRIMARY KEY, i int4);
INSERT INTO test (i) SELECT random() * 1000000 FROM generate_series(1, 100000);

A simple table with only 2 columns, one of which is the primary key (which is indexed). Now, let’s try a simple query:

EXPLAIN analyze SELECT id FROM test ORDER BY id LIMIT 10;
                                        QUERY PLAN                                                             
---------------------------------------------------------------------------------------------
 LIMIT  (cost=0.28..0.57 ROWS=10 width=4) (actual TIME=0.038..0.041 ROWS=10 loops=1)
   ->  INDEX ONLY Scan USING test_pkey ON test  (cost=0.27..2603.31 ROWS=100000 width=4) (actual TIME=0.033..0.038 ROWS=10 loops=1)
         Heap Fetches: 0
 Total runtime: 0.091 ms
(4 ROWS)

The planner has determined that all the data that we want to extract is available in the index, and chose to execute an Index Only Scan, which returned the data straight from the index. And with the autovacuum feature enabled, we don’t have to worry about the index data being “fresh”.

Bitmap Index Scan

Bitmap Index Scan is an interesting one. The planner chooses this type of scan when you need to extract many rows from a single table, but not all, and the results are not in a single block (WHERE … OR …). It is mainly designed to improve performance on old spinning disks by reducing the amount of random I/O operations (SSDs also gain a minor improvement, but it’s not as significant).

You see, normally, Index Scans cause random I/O. Why you may ask? Well, let’s assume that we have a table with millions of rows of data. The default page size for PostgreSQL is 8192 bytes, which means that our rows are going to be split into pages. Let’s also assume that the total number of pages for our imaginary table is 100000 pages (which is approx. 781.25 MB).

The Index Scan would extract those pages from the disk randomly until it finds the data that we need (which is slow on old spinning disks).

The Bitmap Scan approach is different. It always comes with (at least) 2 operations: Bitmap Heap Scan (the higher-level operation), and Bitmap Index Scan (the lower-level operation).

Going back to our imaginary table with 100000 pages, the Bitmap Index Scan would create a bitmap with a bit for every page in the table, resulting in a memory block of 100000 bits ~ 12.5 kb. Initially, all those bits will be set to 0. Then, the Bitmap Index Scan will set some of those bits to 1 for the pages that could potentially contain the rows that need to be returned.

It’s worth mentioning that bitmap creation doesn’t involve table data. It only works with index data (which is the reason why it may be inaccurate).

Once the bitmap is finished, it is passed to the upper node – Bitmap Heap Scan. This node then proceeds to read the pages in a sequential manner, which is faster than getting a single page.

Now that the explanation part is over, let’s finally take a look at how a Bitmap Index Scan query plan looks like:

EXPLAIN analyze SELECT * FROM test WHERE i < 100000;
                                        QUERY PLAN                                                             
---------------------------------------------------------------------------------------------

Bitmap Heap Scan on test  (cost=4.37..39.99 rows=10 width=8) (actual time=0.023..0.043 rows=10 loops=1)
  Recheck Cond: (i < 100000)
  Heap Blocks: exact=9
  ->  Bitmap Index Scan on i1  (cost=0.00..4.37 rows=10 width=0) (actual time=0.011..0.011 rows=10 loops=1)
        Index Cond: (i < 100000)
Planning Time: 0.182 ms
Execution Time: 0.070 ms

There are a few elements here that you may have not seen in a query plan yet. Let’s go over them from bottom to top:

Index Cond: This shows the condition based on which the bitmap is going to be created.

Heap Blocks: This shows how many bits from the bitmap had been set to 1. In this case, there are 9, which means that there are 9 pages of data that could potentially contain the data we’re looking for.

Recheck Cond: Indicates that the Bitmap Heap Scan node is going to check for the condition again. The Bitmap Index Scan marked some pages for checking, but don’t forget that it did so only based on the index, which may be outdated. It is also possible that the marked page only partially contains the data we’re looking for, so the Bitmap Heap Scan must check for the condition again before returning the final result.

There is another interesting feature that Bitmap Index Scan has. It can join 2 operations (indexes) together. Since the bitmap is just a memory block filled with 0 and 1, that makes it remarkably simple to execute logical operations on it (and, or and not).

Let’s have a look at the following example:

EXPLAIN analyze SELECT * FROM test WHERE i < 5000000 OR i > 900000000;
                                        QUERY PLAN                                                             
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on test  (cost=204.65..805.31 rows=10459 width=8) (actual time=1.117..2.929 rows=10522 loops=1)
  Recheck Cond: ((i < 5000000) OR (i > 900000000))
  Heap Blocks: exact=443
  ->  BitmapOr  (cost=204.65..204.65 rows=10511 width=0) (actual time=1.048..1.049 rows=0 loops=1)
        ->  Bitmap Index Scan on i1  (cost=0.00..12.16 rows=515 width=0) (actual time=0.087..0.087 rows=470 loops=1)
              Index Cond: (i < 5000000)
        ->  Bitmap Index Scan on i1  (cost=0.00..187.26 rows=9996 width=0) (actual time=0.960..0.961 rows=10052 loops=1)
              Index Cond: (i > 900000000)
Planning Time: 0.164 ms
Execution Time: 3.306 ms

You can see that for this query, a new node has appeared in the plan: BitmapOr. As the title suggests, this node executes an OR operation on the 2 bitmaps generated by Index Cond, resulting in a single bitmap that satisfies both conditions.

The same logic applies to the other logical operation: AND and NOT. Here is an example involving BitwiseAnd:

EXPLAIN analyze SELECT * FROM test WHERE j < 50000000 AND i < 50000000 AND h > 950000000;
                                       QUERY PLAN                                                       
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=280.76..323.61 ROWS=21 width=16) (actual TIME=2.295..2.352 ROWS=20 loops=1)
   Recheck Cond: ((h > 950000000) AND (j < 50000000) AND (i < 50000000))
   ->  BitmapAnd  (cost=280.76..280.76 ROWS=12 width=0) (actual TIME=2.278..2.278 ROWS=0 loops=1)
         ->  Bitmap INDEX Scan ON i3  (cost=0.00..92.53 ROWS=4832 width=0) (actual TIME=0.546..0.546 ROWS=4938 loops=1)
               INDEX Cond: (h > 950000000)
         ->  Bitmap INDEX Scan ON i2  (cost=0.00..93.76 ROWS=4996 width=0) (actual TIME=0.783..0.783 ROWS=5021 loops=1)
               INDEX Cond: (j < 50000000)
         ->  Bitmap INDEX Scan ON i1  (cost=0.00..93.96 ROWS=5022 width=0) (actual TIME=0.798..0.798 ROWS=4998 loops=1)
               INDEX Cond: (i < 50000000)
 Total runtime: 2.428 ms
(10 ROWS)

In the example above, we have 3 Index Conds, all joined together by the BitmapAnd node.

If you’ve been carefully reading those query plans, then you might be wondering why the BitmapAnd and BitmapOr nodes have actual rows returned 0. The answer is very simple: because those nodes only execute bitwise operations, and don’t return any rows. Their only purpose is to feed the Bitmap Heap Scan node with the proper bitmap, for the Bitmap Heap Scan to then extract the necessary rows from the table data based on it.

If you try to follow the examples provided, then you might get different query plans. Don’t worry, that is something normal. The query planner differs from version to version, and its decisions are also dependent on a lot of factors. For example, I executed the above query on a different computer with different specifications and a different PostgreSQL version, and got the following plan:

EXPLAIN analyze SELECT * FROM test WHERE j < 50000000 AND i < 50000000 AND h > 950000000;
                                       QUERY PLAN                                                       
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on test  (cost=98.68..728.93 rows=567 width=16) (actual time=0.662..2.360 rows=20 loops=1)
  Recheck Cond: (i < 50000000)
  Filter: ((j < 50000000) AND (h > 950000000))
  Rows Removed by Filter: 5019
  Heap Blocks: exact=540
  ->  Bitmap Index Scan on i1  (cost=0.00..98.54 rows=5100 width=0) (actual time=0.468..0.468 rows=5039 loops=1)
        Index Cond: (i < 50000000)
Planning Time: 0.198 ms
Execution Time: 2.403 ms

As you can see, this time the planner decided not to use BitmapAnd. It had determined that a combined filter would be more optimal. Same query but completely different query plans!

Function Scan

Let’s take a quick break and look into something more simple. Function Scan is as simple as it can get. It runs a function that returns a result set. The only important detail about this type of scan is that it only appears when the called function (potentially) returns multiple rows. So, it is not going to appear when we call functions like lower() or upper().

Here is an example:

explain analyze select * from generate_Series(1,10);

Function Scan on generate_series  (cost=0.00..0.10 rows=10 width=4) (actual time=0.006..0.007 rows=10 loops=1)
Planning Time: 0.038 ms
Execution Time: 0.024 ms

The only additional logic a Function Scan can have is filtering. Let’s take a look at the following example:

explain analyze select * from generate_Series(1,10) i where i < 5;

Function Scan on generate_series i  (cost=0.00..0.13 rows=3 width=4) (actual time=0.007..0.008 rows=4 loops=1)
  Filter: (i < 5)
  Rows Removed by Filter: 6
Planning Time: 0.035 ms
Execution Time: 0.021 ms

SORT

Sorting as a process is pretty straightforward and self-explanatory. The result set of a query is re-ordered according to specific criteria. However, when we take a look at the things that PostgreSQL does behind the scenes to make this process optimal, we quickly realize that sorting is not as simple as it may appear at a first glance.

Before we dive into the logic of the PostgreSQL sorting, here is a simple example of a sorting operation:

explain analyze select * from pg_class order by relname;

Sort  (cost=33.44..34.41 rows=386 width=265) (actual time=0.403..0.421 rows=386 loops=1)
  Sort Key: relname
  Sort Method: quicksort  Memory: 128kB
  ->  Seq Scan on pg_class  (cost=0.00..16.86 rows=386 width=265) (actual time=0.012..0.080 rows=386 loops=1)
Planning Time: 0.450 ms
Execution Time: 0.470 ms

There are 2 important elements in a sorting plan:

Sort Key: Shows the name of the column based on which the sorting is done. These are the columns specified after the order by clause in a query.

Sort Method: Shows the sorting method (memory-based/disk-based), the chosen sorting algorithm and the memory footprint of the entire sorting operation.

There are 2 ways in which PostgreSQL can sort a result-set: memory-based and disk-based. The decision of whether to use one method or the other is taken based on the work_mem property from postgresql.conf. Note, that these 2 sorting methods are not sorting algorithms. The sorting algorithm is chosen based on a number of several factors, including whether the result-set is sorted in memory or on disk.

You can read more about work_mem here, but in short, memory-based sorting is the default one (since it’s faster), and if the memory used for sorting goes above the threshold defined by the work_mem property, PostgreSQL switches to disk-based sorting.

Here is an example of a disk-based sorting plan:

explain analyze select random() as x from generate_series(1,150000) i order by x;

Sort  (cost=16821.95..17196.95 rows=150000 width=8) (actual time=93.491..103.761 rows=150000 loops=1)
  Sort Key: (random())
  Sort Method: external merge  Disk: 2656kB
  ->  Function Scan on generate_series i  (cost=0.00..1875.00 rows=150000 width=8) (actual time=23.798..37.205 rows=150000 loops=1)
Planning Time: 0.089 ms
Execution Time: 108.149 ms

Note the sort method changed and also the total memory needed for the sort is also much bigger than in the previous example: 2656kB.

For disk-based sorting, PostgreSQL stores temporary files in the $PGDATA/base/pgsql_tmp/ directory. Of course, once the files are no longer needed, they will be removed.

Limit

The limit keyword is used to specify how many rows (at most) we wish for the result-set to contain. The limit operation works from top to bottom and takes the first N rows.

The limit operation itself can be used in any query, however, when used in combination with an order by clause, interesting things happen in the background.

Most of the time when sorting, you will see PostgreSQL use the quicksort algorithm. The sorting algorithm can, of course, change depending on certain conditions, and one such condition is the limit clause.

When using both order by and limit keywords in a query, the plan becomes similar to this:

explain analyze select * from pg_class order by relfilenode limit 5;

Limit  (cost=23.27..23.28 rows=5 width=265) (actual time=0.172..0.173 rows=5 loops=1)
  ->  Sort  (cost=23.27..24.24 rows=386 width=265) (actual time=0.171..0.171 rows=5 loops=1)
        Sort Key: relfilenode
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Seq Scan on pg_class  (cost=0.00..16.86 rows=386 width=265) (actual time=0.010..0.076 rows=386 loops=1)
Planning Time: 0.379 ms
Execution Time: 0.227 ms

Notice here that the sorting algorithm, in this case, is top-N heapsort.

PostgreSQL is smart enough to know that it doesn’t need to sort the entire result-set, since we only need a portion of it.

In terms of Big O notation, the sorting operations (usually) have O(m * log(m)) complexity, however, top-N has O(m * log(n)), where m is the number of rows in the table, and n is the number of returned rows. But what’s more important, is that top-N uses less memory, and is more likely to stick to the memory-based sorting method, instead of the slower disk-based one.


That’s it for this article. Please note, that if you try to execute the exact same queries from this article, you may get slightly different plans. This is normal since the plan is affected by many factors, including things like hardware, config, and of course the version of PostgreSQL itself. The most important thing is knowing the meaning of different nodes and their values.

Hope this helps, good luck!

Share this article:

Mihail Curchi
Java Developer