Skip to content
Search ESC
KafkaIsolation ForestFoundation ModelsReal-Time Datascikit-learnRedis

Real-time anomaly detection processing 2.4M events/day with 70% fewer false positives

How we built a real-time anomaly detection pipeline processing 2.4M events/day using Kafka, Isolation Forest, and foundation models. False positive rate reduced from 68% to under 20%.

Bottom Line

ML ensemble on Kafka reduced false positives from 68% to under 20% at 2.4M events/day. Trade-off accepted: 340ms added latency per event for multi-model scoring.

// system_metrics
events_day: 2.4M
detection_latency: <200ms
facilities_connected: 14
alert_accuracy_improvement: 73%

The Problem

Batch processing missed anomalies by 6-8 hours

The existing anomaly detection system ran nightly batch jobs against a PostgreSQL data warehouse. By the time an alert fired, the billing irregularity or access violation had been in production for 6-8 hours — long enough for cascading damage.

The false positive rate was the bigger problem. At 68% false positives, the compliance team had stopped trusting the system entirely. They were manually reviewing every alert, which meant the real anomalies were buried in noise.

  • 6-8 hour detection lag: batch processing ran overnight, alerts arrived the next morning
  • 68% false positive rate: compliance team ignored most alerts
  • No cross-facility correlation: each facility’s data was siloed in separate databases
  • Static thresholds: hand-tuned rules that hadn’t been updated in 18 months
  • Zero contextual understanding: no way to distinguish seasonal patterns from real anomalies

Our Approach

Event-driven pipeline with behavioral baselines

We replaced the batch architecture with an event-driven pipeline built on Apache Kafka. Every transaction, access log entry, and prescription event streams through a unified topic structure. The detection engine processes each event within 200ms of ingestion.

The core insight: static thresholds fail because “normal” changes. A physician prescribing 40 opioid prescriptions per month might be anomalous in a rural clinic but expected in a pain management center. We built behavioral baselines per entity (physician, facility, department) using Isolation Forest, then layered foundation model reasoning for contextual interpretation.

The Architecture

Healthcare anomaly detection architecture — streaming pipeline from data sources through Kafka CDC, FAISS enrichment, Isolation Forest, and FM reasoning to alert dashboard

Fig 1 — Streaming anomaly detection with FM reasoning

Three-layer detection with foundation model reasoning

Layer 1: Streaming ingestion and enrichment

Kafka Connect pulls from 14 EMR systems via CDC (Change Data Capture). A Kafka Streams application handles deduplication, schema normalization, and entity enrichment — joining transaction events with physician profiles, facility metadata, and historical baselines from Redis.

Layer 2: Isolation Forest anomaly scoring

Each enriched event passes through an Isolation Forest model trained on 90 days of facility-specific data. The model produces an anomaly score (0-1) based on 23 features including transaction amount deviation, prescription frequency, access time patterns, and cross-facility velocity checks.

Events scoring above 0.7 are flagged for contextual review. Events above 0.9 trigger immediate alerts regardless of context.

Layer 3: Foundation model contextual reasoning

Flagged events (score 0.7-0.9) pass to a foundation model that receives the anomaly score, the entity’s behavioral baseline, and a structured context window of recent activity. The model determines whether the anomaly is expected variance (flu season spike, new physician onboarding) or genuine concern (credential sharing, prescription splitting).

This layer reduced false positives from 68% to under 20% compared to threshold-only detection. The model doesn’t make final decisions — it enriches the alert with reasoning that the compliance team reviews.

Results

Before and after comparison

Before:

  • Nightly batch processing (6-8 hour delay)
  • 68% false positive rate on alerts
  • Siloed databases per facility (14 separate systems)
  • Static thresholds hand-tuned 18 months ago
  • No cross-facility correlation or velocity checks
  • Compliance team manually reviews every alert
  • Zero contextual reasoning on flagged events

After:

  • Real-time streaming (<200ms detection latency)
  • False positive rate reduced from 68% to under 20%
  • Unified event stream across all 14 facilities
  • Behavioral baselines that adapt per entity
  • Cross-facility velocity detection in real-time
  • Priority-ranked alerts with FM reasoning context
  • 73% improvement in alert accuracy

Architecture Trade-offs

Gain

False positives dropped from 68% to under 20%. Foundation model contextual reasoning distinguishes seasonal variance from genuine anomalies — compliance team trusts alerts again.

Cost

340ms added latency per flagged event. Foundation model inference adds processing time on the 0.7-0.9 anomaly band. Accepted because sub-second latency is sufficient for compliance review workflows — this is not a trading system.

Gain

Cross-facility velocity detection in real-time. Unified Kafka topic structure enables credential-sharing and prescription-splitting detection across all 14 facilities simultaneously.

Cost

Redis memory footprint: 12 GB for behavioral baselines. Per-entity baselines across 14 facilities require dedicated Redis cluster. Worth it: in-memory lookup keeps the p99 under 200ms.

Key Learnings

Engineering decisions that shaped the outcome

  • Isolation Forest over autoencoders: 10x faster inference at marginal accuracy trade-off. At 2.4M events/day, inference latency was the constraint, not model sophistication
  • Foundation models for reasoning, not detection: using LLMs for scoring is cost-prohibitive at this volume. We use them only for the ~3% of events that need contextual interpretation
  • Per-entity baselines over global models: a single facility-wide threshold produced the 68% false positive rate. Entity-level baselines cut it to under 20%
  • Redis feature store over batch lookups: the enrichment step requires sub-10ms entity profile retrieval. PostgreSQL couldn’t keep up under load
  • Kafka topic-per-facility, not topic-per-event-type: enables facility-level scaling and isolation without consumer group complexity

This engagement drew on multiple practice areas:

Technology Stack

What we built with

KafkaIsolation ForestFoundation ModelsReal-Time Datascikit-learnRedisPostgreSQLPython 3.11FastAPIDockerKubernetesPrometheusGrafanaPagerDuty
Similar challenge?

Deploy this architecture

Submit your requirements. We'll review your constraints, identify bottlenecks, and scope the path to production.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

From the team behind Production-Ready AI Agents (Amazon, 2025)