Declutter your PostgreSQL DB

Channeling your inner Marie Kondo

Jul 03, 2021

Now, if you have a PostgreSQL DB in production for a little while, you probably need to do some spring cleaning. This will save you some storage space and might even improve your DB’s performance. Marie Kondo said it best.

best marie kondo quotes spark joy declutter life tidying up quotes — Marie Kondo talking about your PostgreSQL DB

Let’s try to break down different kinds of clutter in your DB.

Unused Tables

You may have some unused tables lying around. Maybe you ran some scripts that created some tables or maybe these are remnants of a backend that is no longer with us (R.I.P backend service).

Now, if your DB has 100+ tables, this becomes very difficult to know what’s being used and what’s not. Fear not! Postgres comes with a handy view called pg_stat_all_tables that tells how you often your tables have been used recently. Now, these stats are typically since the beginning of time, so you should run run pg_stat_reset and then wait 7 days of production usage and then look at stats. Running pg_stat_reset makes all the stats 0 and counting begins afresh.

select * from pg_stat_all_tables 
where 
schemaname not in ('pg_catalog', 'pg_toast') 
and seq_scan = 0 
and idx_scan = 0;

You’ve got your list of potentially unused tables. The query lists all tables that have not seen a sequential scan or an index scan in the last 7 days. Review this list closely before dropping the tables!

Unused Indexes

Indexes are cool. They allow you to quickly identify rows you are interested in without reading other parts of the table. It is possible to go overboard with indexes, though. The PostgreSQL planner may not be actually using these indexes during query execution. There might even be duplicate indexes on the same set of columns. Yes - it happens :(

Indexes are not free. Every index takes up some space. Also, every additional index makes it a little more expensive to add or edit rows in the table because the index needs to be updated as well.

Just like pg_stat_all_tables, there’s a helpful view called pg_stat_all_indexes which gives you usage stats on every index. Just like the table stats, you should run pg_stat_reset and wait 7 days of production usage to build up the most recent stats. Then run this statement.

with t0 as (
  select 
  s.schemaname,
  s.relname as tablename,
  s.indexrelname as indexname,
  i.indkey as keys,
  i.indisunique as is_unique,
  exists (select 1 from pg_catalog.pg_constraint c where c.conindid = i.indexrelid) as is_constraint,
  s.idx_scan as num_scans,
  pg_relation_size(s.indexrelid) as size
  from
  pg_catalog.pg_stat_user_indexes s
  join pg_catalog.pg_index i on (s.indexrelid = i.indexrelid)
)
select * from t0
where
num_scans = 0 -- (1)
and 0 <> ALL (keys) -- (2)
and not is_unique -- (3)
and not is_constraint -- (4)
and schemaname not in ('pg_catalog', 'pg_toast')
order by size desc
;

The above SQL may look a little scary, so let’s deconstruct it a bit. Let’s start by looking at the final filters marked as comments above:

Indexes that haven’t seen any scan operations
Indexes that aren’t based on an expression - this isn’t strictly required, but we’ll play it a little safe. keys is an int2vector type that refers to the columns of the table that this index references (see this link for more detailed explanation)
Indexes that don’t enforce a UNIQUE constraint
Indexes that don’t enforce any other constraints

That’s it, you now have a list of potentially useless indexes. Please review the list carefully before dropping them.

Table Bloats

Finally, your tables may be getting bloated. What is bloating, you ask?

PostgreSQL famously uses MVCC (multi-version concurrency control). This is a pretty cool thing and I’ll do a separate post on the topic if there’s interest. However, what happens is that there are multiple versions of the same “row” of data. Most of these versions become inaccessible and useless leading to “dead rows”. This is clutter.

Estimating bloat is a little more complicated. There is a nice script you have find here courtesy the nice folks at pgexperts. Running this script will give you an idea of tables that have a lot of bloat or dead rows. The output looks something like this:

databasename |   schemaname    |        tablename         | can_estimate | est_rows | pct_bloat | mb_bloat | table_mb  
--------------+-----------------+--------------------------+--------------+----------+-----------+----------+----------  prodperf     | public          | advance_properties       | t            |    21900 |        68 |    18.56 |   27.359  prodperf     | public          | org_user_properties      | t            |    86939 |        66 |    56.98 |   86.523

The mb_bloat is the most interesting column that gives you the amount of bloat in MB. Now you can run a vacuum full on these tables to reclaim this space.

Vacuum full is an expensive operation. It takes up an exclusive lock, so please run this only during a maintenance window. This is a tricky operation and you should learn about Vacuum full behavior before embarking on this.

Now Do This Regularly

Clutter routinely builds up in your DB. Make sure you spend at least 1 day a quarter doing some decluttering. If you have any other tips for decluttering the DB, do add to the comments.

SaaS Engineering