Back to work

Fault-Tolerant Distributed Scheduler

Role

Lead Backend Engineer

Timeline

2024

Team

4 Engineers

Tech Stack

Go, Redis, Kubernetes, gRPC

Fault-Tolerant Distributed Scheduler

Overview

High-throughput task scheduling engine

A distributed system designed to handle high-frequency task scheduling across multiple regions. Built with Go and Redis, it processes over 100k tasks per second with sub-millisecond latency, ensuring reliability for critical financial operations.

System Overview Dashboard

The Problem

Scaling bottlenecks in legacy systems

The previous RabbitMQ-based system struggled with throughput limitations during peak trading hours. We faced increasing latency and occasional message loss, which was unacceptable for financial transaction processing. The system needed to scale horizontally while maintaining strict ordering guarantees.

The Solution

Redis Streams for persistence and specific ordering

We re-architected the solution using Redis Streams to leverage its low-latency persistence and consumer group features. We implemented a custom partitioner in Go to ensure even distribution of tasks across worker nodes, and deployed the system on Kubernetes for elastic scaling.

The Flow

Architectural System Flow

The system follows a producer-consumer pattern where the API Gateway ingests tasks and pushes them to Redis Streams. Worker nodes consume these tasks, process them, and update the state in PostgreSQL. The entire flow is monitored via Prometheus and Grafana.

graph TD Client[Client Application] --> API[gRPC API Gateway] API --> Manager[Task Manager] Manager --> Store[(Redis Cluster)] Manager --> Worker1[Worker Node A] Manager --> Worker2[Worker Node B] Worker1 --> DB[(PostgreSQL)] Worker2 --> DB

System Architecture

High-level view of the distributed components.

erDiagram TASKS ||--o{ ATTEMPTS : has TASKS { string id PK string payload timestamp scheduled_at string status } ATTEMPTS { string id PK string task_id FK timestamp started_at string error_log }

Database Schema

Entity relationship diagram for task persistence.

Reflection

Trade-offs in consistency vs availability

Choosing Redis Streams provided the necessary speed but required implementing application-level acknowledgement logic to ensure at-least-once delivery. In hindsight, this complexity was worth the performance gains, but it highlighted the importance of robust monitoring for edge cases.