The PostgreSQL Locking Trap That Killed Our Production API (and How We Fixed It)

2025-06-17

TL;DR

A simple ALTER TABLE statement can bring down your production infrastructure if you’re not careful. This is the story of me waking up to production alerts while working on totally unrelated tasks (building slides for some business stuff), getting bamboozled, instinctively blaming most recent production infrastructure change and eventually figuring out how deep the rabbit hole goes.

Here is the chain of events if you are skimming through:

Google Cloud monitoring policy triggered on database error threshold breach
Production APIs showing high latency and intermittent timeouts
Initially blamed the recently deployed database read replicas as the root cause
Stopped replication, restarted database instance for service restoration
Quickly figured out the issue reappeared on high load
Manually deleted PostgreSQL replication slot from primary database instance
Manually deleted read replica instances (non-critical), hoping against hope
Raised support ticket with Google cloud
Slow query analysis showed multiple pending ALTER TABLE .. queued up
Shocked realizing that we ran an ALTER TABLE .. on a very large and read heavy table even with multiple guardrails in place
Killed all ALTER TABLE .. queries and blocked all schema migration in production till a permanent fix was implemented
All services restored. FINALLY!
Identified the root cause as a lock contention issue across multiple background job workers, schema migrator and long running transactions
Isolated application level locking from business logic tables to a common locks table to avoid lock contention with ALTER TABLE which in turn locks the entire table
All Services restored including internal release related services

The Incident: When Everything Just… Stopped

It started with Google Cloud monitoring alerts triggered on database error threshold being breached. Saw a whole bunch of error logs with Context cancelled by user statement.

The immediate reaction (panic) was to blame the recently provisioned read replica for the production database instance. This was done to safely execute internal analytical queries with Metabase, a non-critical internal service. My response was

Stop the replication in the replica instance using Google Cloud console
Restart the primary database instance hoping that any performance issues due to replication would be resolved

This temporarily restored the services but the errors quickly returned. It also correlates with time of the day when our usage is at its peak. My mind correlated this by making a hypothesis that we introduced a new query that may have a missing index leading to a table scan. But I could not validate this hypothesis by looking at CPU, memory and IO metrics of the primary database instance.

In fact, we did not see any replication lag during the last 24 hours. This made me question my hypothesis of the issue caused by read replicas. Even though that was the most recent infrastructure change that I could confirm looking at our Terraform repository commit history. Finally decided its time to dig deep because of:

No CPU / memory / IO wait metric anomaly for primary database instance
No replication lag in read replicas
Binary log size (storage) started increasing on primary when replication was stopped. This is expected due to active replication slots in the primary.
Manually deleted PostgreSQL replication slots from primary database instance
Even deleted the replicas assuming some weirdness with Google CloudSQL and its internal high availability configuration

Find the slots used by replicas.

SELECT * FROM pg_replication_slots;

Delete the slot

SELECT pg_drop_replication_slot('$slot_name');

The Real Culprit: Lock Contention Hell

It was clear that reverting recent changes is not going to fix the issue. It was time to dig deeper. Retrospectively, I think I planned to do the following:

Identify long running or expensive queries
Identify the root cause of the expensive queries
Optimize them by rewriting them or using appropriate indexes

The following query was used to list all active queries and sort them by the query age. This gives a view of the long running queries along with the PID and other information required to kill the query if required.

SELECT datname,
    usename,
    application_name as appname,
    pid,
    client_addr,
    state,
    now() - backend_start as conn_age,
    now() - xact_start as xact_age,
    now() - query_start as query_age,
    now() - state_change as last_activity_age,
    wait_event_type,
    wait_event,
    query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_age DESC
LIMIT 100;

The output showed:

Multiple SELECT .. FOR UPDATE style locks waiting to be acquired
Multiple ALTER TABLE .. ADD COLUMN statements queued up
Multiple INSERT .. ON CONFLICT DO NOTHING statements queued up

The most interesting bit is, all these queries were waiting to acquire a lock on the malware_analyses table. This is the table where we store OSS package scanning jobs powering SafeDep vet, updated by background job workers and queried by a user facing API. At this point, I was fairly sure it is a lock contention issue but wanted to confirm it before taking any action, especially since I already exhausted myself by reverting infrastructure changes as panic response.

The Perfect Storm: How It All Went Wrong

Lets have a quick look at the components that act on the malware_analyses table and how they interact with each other. The system consists of the following logical components:

Submission API that is idempotent and can be retried without creating duplicate jobs
Background job workers that actually execute an OSS package analysis job with a timeout of 15 minutes
Query API that is used to fetch the status and results of a job by its job identifier

We also have a schema migrator built as part of our application development framework, which internally executes GORM migrations with additional safety checks that guarantee timeouts, global locks, audit logs and more.

Submission API

Our API framework is built to execute a business logic (service layer) in transaction by default for consistency unless explicitly opted out by the service specification. We generally opt out of transactions only for read-only operations. In case of the submission API, the service logic does the following in a transaction:

Check if the job already exists in the database
Create a new job record in the database if it does not exist
Create a background job to execute the OSS package analysis job (transactional consistency)
Return the job identifier to the client

Note: The submission API is idempotent which requires attempting to read a row from the malware_analyses table to check if the job already exists before performing an INSERT operation. The table uses unique index constraints to guarantee idempotency even when there is a race condition. This means, INSERT .. cannot be concurrent.

INSERT into tables that lack unique indexes will not be blocked by concurrent activity. Tables with unique indexes might block if concurrent sessions perform actions that lock or modify rows matching the unique index values. Ref

Package Analysis Business Logic

The background job workers execute the OSS package analysis job with a timeout of 15 minutes. The job is executed in a transaction and the following steps are performed:

Acquire a row-level lock on the malware_analyses record to prevent multiple workers from executing the same job concurrently
Execute a long running RPC call to an external service that takes up to 5 minutes to complete
Update the job record in the database

Example code snippet from the background job worker:

// This code was holding row-level locks for ~60 seconds
func (j *malysisAnalyzePackageJobExecutor) Execute(ctx ectx.C, job MalysisAnalyzePackageJob) error {
    tx := ctx.DB()
    
    // This acquires a row-level lock with SELECT ... FOR UPDATE
    var analysis models.MalwareAnalysis
    err := tx.Clauses(clause.Locking{
        Strength: "UPDATE",  // Row-level lock
        Options:  "NOWAIT",
    }).Where("external_id = ?", job.AnalysisID).First(&analysis).Error
    
    // This RPC call typically takes 30-60 seconds
    err = async.RpcInvokeWithNamespace(ctx, j.client, j.config.ClientNamespace, 
        procedureName, request, &response, async.RpcCallOptions{
            Timeout: 5 * time.Minute,
        })
    
    // Still holding the lock while we update
    err = tx.Save(&analysis).Error
    return nil
}

Schema Migration

The schema migrator is a tool that we use to safely apply schema changes to the database. In this particular case, following was the schema change that triggered the issue:

 // MalwareAnalysis is the model for serving Malysis v1 API.
 type MalwareAnalysis struct {
        GlobalModel
@@ -66,6 +87,12 @@ type MalwareAnalysis struct {
 
        // Set time for when the is_malware field was set to false
        WithdrawnAt *time.Time `gorm:"index"`
+
+       // Package publishers
+       Publishers datatypes.JSONSlice[PackagePublisher] `gorm:"type:jsonb;default:'[]'::json;index:,type:gin"`
+
+       // Package download statistics
+       Downloads datatypes.JSONType[PackageDownloads] `gorm:"type:jsonb;default:'{}'::json;serializer:json;index:,type:gin"`
 }

This schema change adds two new columns to the malware_analyses table along with adding indexes on them. This translates to the following SQL statement:

ALTER TABLE malware_analyses ADD COLUMN publishers JSONB;
ALTER TABLE malware_analyses ADD COLUMN downloads JSONB;
CREATE INDEX idx_malware_analyses_publishers ON malware_analyses USING gin (publishers);
CREATE INDEX idx_malware_analyses_downloads ON malware_analyses USING gin (downloads);

Thing to note here:

ALTER TABLE .. ADD COLUMN statement needs an AccessExclusiveLock on the table even though its just a table metadata update
The index creation is a separate operation that, by default, needs an AccessExclusiveLock on the table unless CONCURRENTLY is explicitly specified

PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. Ref

The Lock Hierarchy That Kills

Here’s what PostgreSQL’s documentation tells us about AccessExclusiveLock:

“Conflicts with ALL other lock modes. Guarantees that the holder is the only transaction accessing the table in any way.”

Now that we saw the lock hierarchy, we can see there are multiple contenders that contributed to the incident by holding locks or trying to lock the entire tabling by requesting AccessExclusiveLock.

The root cause of the issue is the following sequence of events:

Background jobs start - Acquire row-level locks on malware_analyses
Schema migration starts - Waits for AccessExclusiveLock on malware_analyses
New Submission API requests arrive - Queue behind the ALTER TABLE waiting for table access
Cascading failure - All API (app) server Go routines blocked on database calls
System death - API becomes unresponsive, not just for impacted tables but for all API endpoints

The Fix: Isolate Locks

Two things were clear so far:

Long running transactions are toxic - The best practices are right!
Schema migration on large and busy tables are VERY VERY risky

Lets ignore the fact that we tried creating an index on a large table, we can mitigate it by updating our schema migration process to leverage maintenance window and concurrency primitives offered by PostgreSQL. But this incident can repeat again even for a harmless ALTER TABLE .. ADD COLUMN statement that only modifies table metadata. This is because, the ALTER TABLE .. will wait for AccessExclusiveLock on the table even though its just a table metadata update. Row level locks held by background job workers makes the table slow and risky for schema changes.

Locking is however an application primitive that we need. We have two options:

Use an external service like Redis for locking
Isolate locking from business logic tables to a common locks table

While blogs and other common wisdom points to [1], we decided against it because it introduces additional complexity of serializing locks and data access across multiple services ie. PostgreSQL and Redis. We decided to continue leverage PostgreSQL for resource (row) level locks but isolate them from business logic tables that may be changed as and when required. Instead, we decided to introduce a common resource_locks table that will rarely require schema changes because it offers a single primitive ie. lock a row by resource_name and resource_id.

Before: Direct Table Locking

Service level, adhoc locking pattern before the incident:

// DON'T DO THIS - locks business logic table
var analysis models.MalwareAnalysis
err := tx.Clauses(clause.Locking{
    Strength: "UPDATE",
}).Where("external_id = ?", analysisID).First(&analysis).Error

After: Common Lock Table

Common resource locking API for use by the service layer. While currently it leverages PostgreSQL backend for locking, the API offers abstractions for us to keep options open for future.

// Create a dedicated locking table
type ResourceLock struct {
    GlobalModel
    ResourceName string `gorm:"not null;uniqueIndex:idx_resource_locks_resource_name_id"`
    ResourceID   string `gorm:"not null;uniqueIndex:idx_resource_locks_resource_name_id"`
    
    // .. Other metadata
}

// Use it for serialization
func WithResourceLock(db *gorm.DB, resource LockableResource, heldBy string, fn func(*gorm.DB) error) error {
    return db.Transaction(func(tx *gorm.DB) error {
        // Create a lock in the resource_locks table
        // Acquire a row level lock on the resource_locks table
            
        // Execute the business logic
        return fn(tx)
    })
}

The New Pattern

Finally, the business logic can be executed without holding any locks on the business logic table and leveraging the common resource locking API.

// Now business logic locks are isolated
return models.WithResourceLock(ctx.DB(), analysisJob, "analyze_package_job",
    func(tx *gorm.DB) error {
        // Business logic here - no locks on malware_analyses table
        var analysis models.MalwareAnalysis
        err := ctx.DB().Where("external_id = ?", job.AnalysisID).First(&analysis).Error
        
        // Long-running RPC call
        err = async.RpcInvokeWithNamespace(...)
        
        // Update without row locks
        err = tx.Save(&analysis).Error
        return nil
    })

Deploying the Fix

So far so good. We have a fix in place. But there are challenges with deployment because it requires schema changes and background job workers to be restarted. We were reluctant to execute schema changes in production without a proper maintenance window but we also needed to unblock releases that required schema changes. So we roughly took the following steps:

Paused all background jobs for the queue that was running the OSS package analysis jobs to avoid any row level locks on the malware_analyses table
Waited for all existing jobs to complete and verified no queued queries in the database
Deployed the fix to production
Applied the schema changes for common locks table
Restarted all background jobs for the queue

Aha! thats how it looks like after the fix, with usual schema migration and background job workers running again.

The Key Insight: Lock Isolation

The breakthrough was realizing that serialization requirements and business data storage are orthogonal concerns. You don’t need to lock the users table to ensure only one password reset email goes out. You don’t need to lock the orders table to prevent duplicate payment processing. You need a coordination mechanism that’s separate from your data storage. This ensures both application queries and schema management operations are not blocking by long held locks which may be required by the business logic (service layer).

The Real Lesson

This isn’t about PostgreSQL being bad or Go being bad or our architecture being bad. This is about the fundamental tension between consistency requirements and availability requirements in distributed systems. The CAP theorem suddenly seems more real, it shows up in your production database unexpectedly when you are trying to add a column and your API dies.