=========================
== Morten Ankerstjerne ==
=========================
Swords and SQL

Stop auto-shrinking your staging database

Data Warehouse ETL Microsoft SQL Server

Background

If you’ve worked with data warehouse development on Microsoft SQL Server, you have probably had a run-in with your SAN-admin at some point, complaining about the size of your databases, and asking if you can do some cleanup to free up disk space.

What often ends up happening is, you take look at the database files and realize there’s a lot of free space that could be released, especially if you have a database dedicated to staging transformed data, before loading it into data marts.

“Sure”, you think, “let’s run DBCC SHRINKFILE to get some of that space back”.

This works, in the sense that the disk now has more space available, your SAN-admin gets off your back, and everyone’s happy.

Until the next day, when your batch jobs ran over night, and all of the space released has been reclaimed by SQL Server.

The problem

The reason this happens is of course that every night, we truncate all the staging tables, then load lots and lots of transformed data into them briefly, before sending the data into their final resting place in the data marts.

After a few cycles back and forth with the SAN-admins, shrinking and growing our data files, someone inevitably either enables autoshrink on the staging database, or sets up a SQL Server Agent job that periodically runs DBCC SHRINKFILE. Both of these options accomplish the same thing, but both of these are bad.

Now I could go off on a tangent about index fragmentation and all the performance horrors of shrinking your database files and rebuilding indexes, but let’s be honest; index fragmentation isn’t really that big of a deal if you aren’t running with low RAM and on hard drives. On top of that, our staging tables are all heaps anyway without any nonclustered indexes, because we truncate and load them every single night, and all we ever do is read all the data from them when loading the data marts.

The problem I would like to illustrate is far more subtle; automatically shrinking your database might run you out of disk space because you were hiding how much space was actually required.

Let’s say you run your daily ETL job and the disk fills up like this:

Diagram showing a data disk filled from the left with 10 % staging file (green), 20 % with free space in the staging file (hatched green), 40 % unused disk space (no colour) and 30 % other database files (grey)

Right now, the staging database file takes up 30 % of the total disk size, with 2/3 of the staging file being free space.

If we shrink the data file, it now appears that our staging database only requires 10 % of the total disk size, leaving us with about 50 % of the disk as unused space - nice, let’s do that automatically and forget about it.

Diagram showing a data disk filled from the left with 20 % staging file (green), 60 % unused disk space (no colour) and 30 % other database files (grey)

But what happens when our data requirements grow? Let’s say our solution has been running steadily for a few years, and the total amount of data we are staging each night has doubled:

Diagram showing a data disk filled from the left with 20 % staging file (green), 40 % with free space in the staging file (hatched green) and 40 % other database files (grey)

If we shrink our data file regularly, it would appear we still have about 40 % unused space on the disk, but the disk is actually filling up every night! Sure, we clean it up afterwards, but it won’t take long for the nightly load to fail due to lack of disk space.

Diagram showing a data disk filled from the left with 20 % staging file (green), 45 % with free space in the staging file (hatched green) and 40 % other database files (grey), with 5 % overlay from the free space staging file (opaque red)

The solution

Now in a perfect world, I would tell you to just load less data - we all know that in practice, one of the reasons we even have this problem in the first place, is due to the Groundhog Day loading pattern.

But nobody has time to rewrite all of their old ETL code to work with deltas - we want to do work that actually provides value to our end-users.

An alternative is to update orchestration slightly, so we reuse the free space in our staging data file as much as possible.

Check your ETL workload, ensure staging tables are truncated after loading data mart tables and check if everything needs to run in parallel at the same time; maybe your SQL Server is already tied for resources because you’re transforming all your huge dimension- and fact tables at the same time. Try to identify tables that can be loaded vertically - ETL doesn’t have to be one step at a time.

And make sure you talk to your SAN-admin - try to understand why getting more disk space is even an issue. Maybe those years of data growth just finally caught up with your data warehouse solution and you either need to provision more disk for your 20 years of data, or get end users to accept that we can’t keep hoarding data forever (like that’s ever going to happen).