Skip to main content

Posts

Fractional CTO / Chief Architect

I help organisations design, build, stabilize, and scale mission critical distributed data systems, including data platforms, Data Lake and Lakehouse architectures, streaming systems, IoT ingestion, and AI infrastructure.

I work with engineering and platform teams to drive clear architecture decisions, reduce systemic risk, and restore delivery momentum without adding full time leadership overhead.

See consulting services

Recent articles:

Bitsight Security Ratings in Production Decision Fabrics

Summary Bitsight delivers daily updated security ratings and detailed findings from external scanning across many risk vectors. This article shows how to turn that data into events in a streaming Decision Fabric. It defines the Decision Fabric as the Kafka-native substrate where events drive agent decisions with shared graph memory and explains the role of KafSIEM for provenance-linked analysis. Concrete implementation examples use event schemas and brain tool calls. The piece covers honest trade-offs on API limits, query latency and observability cost plus the operational shifts that result in faster risk reduction for engineering teams. Bottom Line Bitsight security ratings provide an objective outside-in measurement of cyber risk  that updates every day. The practical way to get value from them is to treat rating changes, risk vector details, and associated findings as immutable events on a Kafka stream. Those events feed both human analysts and autonomous agents that c...

Production CDC Architecture: Debezium Scaling Lessons

Production CDC architecture breaks under load long before most teams expect it. With Debezium, Kafka Connect, and Postgres, the failure patterns are consistent: WAL pressure builds up, connector lag drifts unnoticed, and snapshot phases exhaust memory under bursty traffic. This is based on running these pipelines across high throughput systems, including workloads above 10k TPS. The difference between a system that works and one that holds under pressure comes down to observability, WAL discipline, and how connector scaling is handled. Production Debezium CDC Architecture Operational reality vs. tutorial defaults under real load (10k+ TPS) The Default "Tutorial" Setup Assumes low throughput and stable networks. Fails under pressure. Source: Postgres Single WAL Slot Shared slot coupling multiple connectors Default WAL retention settin...

BacNet => MQTT in Production: The Real Cost of Bridging BACnet to MQTT at Scale

bacnet2mqtt looks simple in a README and expensive in production. Once BACnet polling, reconnection behavior, stale state, and MQTT publishing collide, teams discover they are not deploying a lightweight adapter but operating infrastructure. This article breaks down where bacnet2mqtt works, where it becomes a bottleneck, and which production patterns reduce the operational damage before incidents, backlogs, and silent data loss turn a building integration into a long-running engineering problem. I inherited a building controls integration problem 18 months ago. Three office floors. 217 BACnet sensors covering temperature, occupancy, and HVAC actuators. The data was trapped inside the building automation network while the business wanted analytics, reporting, and compliance visibility in the data platform. The obvious answer looked easy enough: deploy bacnet2mqtt, bridge BACnet into MQTT, and push the stream into the lakehouse stack. The repository made it sound like a w...

Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production

Agent observability breaks down when teams try to force long-lived, stateful workflows into dashboards built for stateless microservices. In production, the real challenge is not collecting more logs. It is reconstructing what the agent saw, what state changed, which tool response altered the workflow, and why the system kept going. This article explains why replayable event streams are a better foundation for multi-agent tracing, how a Kafka-first design makes session replay practical, and where conventional tracing still helps but falls short on its own. The production failure that changed how we instrument agents I still remember the first time one of our production agent systems failed without actually crashing. An invoice-processing agent entered a recursive reasoning loop and burned through hundreds of dollars in API credits over a weekend because it kept insisting a validation error existed when it did not. We had logs. We had metrics. We had distributed traces. None of...

Why We Built Our Multi-Agent System on Kafka (And What We Learned)

The 3:47am Incident That Changed Our Architecture At 3:47am on a Tuesday, our monitoring dashboard lit up. Three different teams had just published the same article about agent observability. Marketing wrote it for the corporate blog. Sales adapted it for a prospect deck. Content produced it for our technical newsletter. All three versions were good. All three were complete. All three appeared within 20 minutes of each other. The problem? Nobody knew the other teams were working on it. This wasn’t a coordination failure. It was an architecture failure. Our multi-agent system had no shared truth. Each department’s agent operated independently, pulling from the same source material, generating similar content, with zero awareness of parallel work. That morning, we rebuilt our agent communication architecture on Kafka. Here’s why, what we learned, and the patterns that emerged from six months in production. The Architecture That Failed Our original multi-agent system looked clean...