Table Bloat

Created at: 2024-12-19

A table is bloated when it's actual size on disk is higher than the actual size of the data.

Any UPDATEs and DELETEs will create "dead tuples" on the tables where these operations are performed.

Dead tuples are tuples that can't be seen in the current database table version but are still stored in disk.

Due to Postgres' MVCC architecture (Multiversion Concurrency Control), tuples aren't necessarily deleted or updated on disk as soon as they are affected by an operation.

Such tuples have the problem of occupying space on disk. This space may be reclaimed by a VACUUM. In certain situations, not even a VACUUM can reclaim these dead tuples. This is the case when active transactions holding those dead tuples exist and the xmin horizon is behind - think about long-running transactions here.

Performance Implications Of Bloated Tables

The same query needs to read through more buffers to satisfy its results, because tuples are sparsely allocated on disk.
These tuples take more room in memory, thus not allowing efficient caching of pages.

Vacuuming

While VACUUM may reclaim dead rows, it won't fix fragmented pages. That is, VACUUM won't reorder the tuples in their stored pages to defragment them, nor will it delete pages that aren't used by the table.

This is only achieved when VACUUM FULL is performed. This operation will effectively copy all tables and create new ones with a compact size on disk.

It's rare the database that can afford a VACUUM FULL these days, as it takes an AccessExclusive lock. Instead, there are alternatives like pg_repack to be used instead.

Indexes can also become defragmented for the same reasons. The pgstattuple.pgstatindex function is useful for checking index fragmentation. Focus on the avg_leaf_density value, anything below 20% isn't great. source.

Bloat

To find the bloat of a table, Postgres' wiki recommends:

To obtain more accurate information about database bloat, please refer to the pgstattuple or pg_freespacemap contrib modules. source

Example

CREATE DATABASE bloat_test;

CREATE EXTENSION IF NOT EXISTS pgstattuple;

DROP TABLE IF EXISTS not_bloated;
CREATE TABLE not_bloated (id SERIAL PRIMARY KEY, val INT);
ALTER TABLE not_bloated SET (autovacuum_enabled = false);
CREATE INDEX not_bloated_idx on not_bloated (val);
INSERT INTO not_bloated (val) SELECT generate_series(1, 100000);

DROP TABLE IF EXISTS bloated;
CREATE TABLE bloated (id SERIAL PRIMARY KEY, val INT);
ALTER TABLE bloated SET (autovacuum_enabled = false);
CREATE INDEX bloated_idx on bloated (val);
INSERT INTO bloated (val) SELECT generate_series(1, 100000);

VACUUM FULL;

-- Delete every second row to fragment the table.
DELETE FROM bloated WHERE val % 2 = 0;
VACUUM ANALYSE bloated;

SELECT * FROM pgstattuple('not_bloated');
-- table_len          | 3629056 (in bytes)
-- tuple_count        | 100000
-- tuple_len          | 3200000
-- tuple_percent      | 88.18
-- dead_tuple_count   | 0
-- dead_tuple_len     | 0
-- dead_tuple_percent | 0
-- free_space         | 16652
-- free_percent       | 0.46

-- The table_len is the same, even though the bloated table has about half of
-- the ammount of duples as in the not_bloated table!
SELECT * FROM pgstattuple('bloated');
-- table_len          | 3629056
-- tuple_count        | 50000
-- tuple_len          | 1600000
-- tuple_percent      | 44.09
-- dead_tuple_count   | 50000
-- dead_tuple_len     | 1600000
-- dead_tuple_percent | 44.09
-- free_space         | 1618424
-- free_percent       | 44.6

SELECT * FROM pgstatindex('not_bloated_idx');
-- version            | 4
-- tree_level         | 1
-- index_size         | 2260992
-- root_block_no      | 3
-- internal_pages     | 1
-- leaf_pages         | 274
-- empty_pages        | 0
-- deleted_pages      | 0
-- avg_leaf_density   | 89.83
-- leaf_fragmentation | 0

-- The leaf density is much lower
SELECT * FROM pgstatindex('bloated_idx');
-- version            | 4
-- tree_level         | 1
-- index_size         | 2260992
-- root_block_no      | 3
-- internal_pages     | 1
-- leaf_pages         | 274
-- empty_pages        | 0
-- deleted_pages      | 0
-- avg_leaf_density   | 45.06
-- leaf_fragmentation | 0

-- Run these EXPLAINs a couple of times
EXPLAIN (ANALYZE, BUFFERS, COSTS off)
SELECT * FROM not_bloated;
--                               QUERY PLAN
---------------------------------------------------------------------------
-- Seq Scan on not_bloated (actual time=0.035..16.198 rows=100000 loops=1)
--   Buffers: shared hit=443
-- Planning Time: 0.142 ms
-- Execution Time: 25.619 ms

-- The bloated table hits the same amount of buffers as the not_bloated table
-- even though it has half the amount of tuples.
EXPLAIN (ANALYZE, BUFFERS, COSTS off)
SELECT * FROM bloated;
--                              QUERY PLAN
-- --------------------------------------------------------------------
--  Seq Scan on bloated (actual time=0.069..15.524 rows=50000 loops=1)
--    Buffers: shared hit=443
--  Planning Time: 0.191 ms
--  Execution Time: 23.297 ms

-- Now using the index instead of seq scan
EXPLAIN (ANALYZE, BUFFERS, COSTS off)
SELECT * FROM not_bloated WHERE val > 1000 and val < 40000;
--                                          QUERY PLAN
-- --------------------------------------------------------------------------------------------
--  Bitmap Heap Scan on not_bloated (actual time=17.933..25.978 rows=38999 loops=1)
--    Recheck Cond: ((val > 1000) AND (val < 40000))
--    Heap Blocks: exact=173
--    Buffers: shared hit=282
--    ->  Bitmap Index Scan on not_bloated_idx (actual time=17.829..17.830 rows=38999 loops=1)
--          Index Cond: ((val > 1000) AND (val < 40000))
--          Buffers: shared hit=109
--  Planning Time: 0.320 ms
--  Execution Time: 29.565 ms

-- Uses the same amount of buffers but the table is half the size!!!
EXPLAIN (ANALYZE, BUFFERS, COSTS off)
SELECT * FROM bloated WHERE val > 1000 and val < 40000;
-- Bitmap Heap Scan on bloated (actual time=17.238..25.120 rows=38999 loops=1)
--   Recheck Cond: ((val > 1000) AND (val < 40000))
--   Heap Blocks: exact=173
--   Buffers: shared hit=282
--   ->  Bitmap Index Scan on bloated_idx (actual time=17.110..17.110 rows=38999 loops=1)
--         Index Cond: ((val > 1000) AND (val < 40000))
--         Buffers: shared hit=109
-- Planning Time: 0.324 ms
-- Execution Time: 28.723 ms

Note from the Postgres documentation:

The table_len will always be greater than the sum of the tuple_len, dead_tuple_len and free_space. The difference is accounted for by fixed page overhead, the per-page table of pointers to tuples, and padding to ensure that tuples are correctly aligned.

We can define table bloat as:

table_bloat = (bloat_size / table_size) * 100

Where

bloat_size = table_len - tuple_len
table_size = table_len

For our example, table bloat in the bloated table is ~56%.

Find The Top 10 Bloated Tables in The Database

SELECT relname, (pgstattuple(oid)).*
FROM pg_class WHERE relkind = 'r'
ORDER BY dead_tuple_percent DESC
LIMIT 10;

Sources

Outro

Since I am using DataDog at the moment, it's worth mentioning that DataDog calculates index bloat using a special query created based on:

The actual query can be found here

The same is true for table bloat, but the query is slightly different source.

Stack overflow question:

How Postgres bloat is dangerous?