Skip to main content

Posts

Fractional CTO / Chief Architect

I help organisations design, build, stabilize, and scale mission critical distributed data systems, including data platforms, Data Lake and Lakehouse architectures, streaming systems, IoT ingestion, and AI infrastructure.

I work with engineering and platform teams to drive clear architecture decisions, reduce systemic risk, and restore delivery momentum without adding full time leadership overhead.

See consulting services

Recent articles:

Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production

Agent observability breaks down when teams try to force long-lived, stateful workflows into dashboards built for stateless microservices. In production, the real challenge is not collecting more logs. It is reconstructing what the agent saw, what state changed, which tool response altered the workflow, and why the system kept going. This article explains why replayable event streams are a better foundation for multi-agent tracing, how a Kafka-first design makes session replay practical, and where conventional tracing still helps but falls short on its own. The production failure that changed how we instrument agents I still remember the first time one of our production agent systems failed without actually crashing. An invoice-processing agent entered a recursive reasoning loop and burned through hundreds of dollars in API credits over a weekend because it kept insisting a validation error existed when it did not. We had logs. We had metrics. We had distributed traces. None of...

Why We Built Our Multi-Agent System on Kafka (And What We Learned)

The 3:47am Incident That Changed Our Architecture At 3:47am on a Tuesday, our monitoring dashboard lit up. Three different teams had just published the same article about agent observability. Marketing wrote it for the corporate blog. Sales adapted it for a prospect deck. Content produced it for our technical newsletter. All three versions were good. All three were complete. All three appeared within 20 minutes of each other. The problem? Nobody knew the other teams were working on it. This wasn’t a coordination failure. It was an architecture failure. Our multi-agent system had no shared truth. Each department’s agent operated independently, pulling from the same source material, generating similar content, with zero awareness of parallel work. That morning, we rebuilt our agent communication architecture on Kafka. Here’s why, what we learned, and the patterns that emerged from six months in production. The Architecture That Failed Our original multi-agent system looked clean...

Building a Model-Agnostic Multi-Agent System with OpenClaw

Over one week we rebuilt our AI stack around OpenClaw’s multi-agent architecture to avoid provider lock-in and stop wasting premium tokens. By aligning models to tasks, diversifying fallbacks across providers, enforcing minimal tool access, and switching to memory-first workflows with ephemeral sessions, we reduced token usage per task by about 70% and cut our monthly bill by 77% while improving operational resilience. How We Achieved 77% Cost Reduction and Provider Independence Over the past week, we rebuilt our AI infrastructure around OpenClaw’s multi-agent architecture. The result was a 77% cost reduction , provider independence , and a delegation system that routes work to the most cost-effective model for each job. Below is the technical journey of optimizing a 7-agent squad with OpenClaw. The Challenge: Model Provider Lock-In We started with a simple problem: our entire squad defaulted to a single model provider. This created three issues: Cost inefficiency beca...

How I built a secure, high-performance AI agent squad with OpenClaw

The short version: We run PaxMachina like an Airflow-style DAG, separating heavy lifting from reasoning to save tokens. We replaced generic vector stores with a specialized Query-Memory-Document (QMD) backend for high-velocity state. We treat Telegram channels as immutable event logs, not watercoolers. And we added a task ledger protocol that prevents the runaway loops plaguing other agent frameworks. AI agents are like Airflow for intelligence I used to think the bottleneck in agent systems was model intelligence. I was wrong. The bottleneck is context hygiene . If you treat an agent like a chatty intern, you burn tokens on coordination and lose state in the noise. The shift that made our system (PaxMachina) work was treating it like an ops pipeline. Specifically, like Airflow DAGs . We separated the "muscle" (gathering data) from the "brain" (reasoning), and we locked down how they talk to each other. If you've followed the recent OpenClaw ...

Connect BACnet to the Cloud with bacnet-mqtt-gateway

The bacnet-mqtt-gateway project is an open source protocol bridge that translates BACnet building automation traffic into MQTT messages for cloud and IoT systems. It provides discovery, polling, bidirectional writes, APIs, security, and easy deployment via Docker. Many enterprises struggle to unify BACnet with modern data pipelines and cloud platforms because BACnet is local-network only and not cloud ready. This gateway provides a scalable, secure, production-ready adapter for MQTT ecosystems and smart building integrations. The Problem with BACnet Building automation runs on BACnet . HVAC controllers, lighting systems, metering equipment: they all speak ASHRAE 135 . The protocol handles local control loops well. It fails at cloud ingress. BACnet relies on UDP broadcasts. These do not route over the internet or into VPCs. Your chiller controller cannot talk to AWS IoT Core . Your VAV box cannot publish to an MQTT broker. The air gap between operational technology and modern cl...