Fault-Tolerant Distributed Scheduler
Role
Lead Backend Engineer
Timeline
2024
Team
4 Engineers
Tech Stack
Go, Redis, Kubernetes, gRPC
Overview
High-throughput task scheduling engine
A distributed system designed to handle high-frequency task scheduling across multiple regions. Built with Go and Redis, it processes over 100k tasks per second with sub-millisecond latency, ensuring reliability for critical financial operations.
The Problem
Scaling bottlenecks in legacy systems
The previous RabbitMQ-based system struggled with throughput limitations during peak trading hours. We faced increasing latency and occasional message loss, which was unacceptable for financial transaction processing. The system needed to scale horizontally while maintaining strict ordering guarantees.
The Solution
Redis Streams for persistence and specific ordering
We re-architected the solution using Redis Streams to leverage its low-latency persistence and consumer group features. We implemented a custom partitioner in Go to ensure even distribution of tasks across worker nodes, and deployed the system on Kubernetes for elastic scaling.
The Flow
Architectural System Flow
The system follows a producer-consumer pattern where the API Gateway ingests tasks and pushes them to Redis Streams. Worker nodes consume these tasks, process them, and update the state in PostgreSQL. The entire flow is monitored via Prometheus and Grafana.
System Architecture
High-level view of the distributed components.
Database Schema
Entity relationship diagram for task persistence.
Reflection
Trade-offs in consistency vs availability
Choosing Redis Streams provided the necessary speed but required implementing application-level acknowledgement logic to ensure at-least-once delivery. In hindsight, this complexity was worth the performance gains, but it highlighted the importance of robust monitoring for edge cases.