RAG Architecture with dbt, LangChain, and the Modern Data Stack

A dangerous divide is emerging in the enterprise data landscape. On one side, you have the established Modern Data Stack (MDS), centered around tools like dbt and the warehouse. On the other side, a parallel RAG stack is forming around LangChain, vector databases, and LLM applications. Many teams treat this as a dbt-versus-LangChain problem, but that is the wrong framing.

The critical “why”: The better question is how dbt, LangChain, and the rest of the modern data stack should work together inside one RAG architecture. When organizations build two disconnected ecosystems, they create silos, duplicate effort, and limit both analytics and AI. A unified stack treats the warehouse and transformation layer as the governed source of truth for retrieval systems instead of building a separate AI data pipeline from scratch.

The Great Divide: Two Stacks, One Goal

To understand the solution, we must first appreciate the problem. Let’s look at the “standard” composition of these two stacks as they are often built today.

Component	The Modern Data Stack (MDS)	The “AI Stack” (RAG)
Primary Goal	Answer known questions with historical, structured data (BI Dashboards).	Answer novel questions with unstructured, semantic data (Conversational AI).
Central Storage	Cloud Data Warehouse (Snowflake, BigQuery, Redshift).	Vector Database (Pinecone, Chroma, Weaviate).
Transformation	dbt (for SQL-based, version-controlled transformations).	Python scripts, LlamaIndex/LangChain document loaders.
Data “Shape”	Tables, rows, and columns. Highly structured.	Text chunks and high-dimensional vectors. Unstructured.
Primary User	Data Analyst, Business Stakeholder.	AI Agent, End User via chatbot.
Viewing them this way, it’s easy to see why teams build them separately. But the most powerful insights lie at their intersection.

The Bridge: A Unified Architecture for Data-Aware RAG

A truly intelligent RAG system needs more than just semantic similarity search. It needs to filter results by structured metadata (“find me documents related to ‘Project X’ created last quarter for customer Y”). This structured data already lives in your data warehouse. The logical conclusion is that the MDS should not be a competitor to the AI Stack; it should be its primary, trusted data source.

Diagram 1: A unified architecture where the MDS (dbt + Warehouse) is the source of truth for the AI Stack.

The Role of dbt: “T” in ELT for Your AI Stack

This is the most powerful and overlooked concept in the unified architecture. Data teams already use dbt to transform raw data into clean, reliable models for analytics. The exact same process and tooling should be used to prepare data for your RAG system.

Instead of one-off Python scripts, your data engineers can create dbt models that:

Ingest data from sources like Salesforce, Zendesk, or internal databases.
Join tables to create a rich, contextual document (e.g., combine a support ticket with the customer’s subscription level and recent activity).
Clean, format, and chunk the text, preparing it perfectly for embedding.
Materialize these prepared documents as a clean table (e.g., docs_to_embed) in the data warehouse.

This approach means your AI’s data source is now version-controlled, testable, and documented right alongside your core business analytics models.

-- Example dbt model: models/prep/docs_to_embed.sql

{{
  config(
    materialized='incremental',
    unique_key='document_id'
  )
}}

SELECT
    s.ticket_id AS document_id,
    'zendesk_ticket' AS source,
    c.customer_name,
    c.subscription_tier,
    s.created_at,
    -- Combine multiple fields into a single text block for embedding
    'Ticket Subject: ' || s.subject || '\n\n' ||
    'Ticket Description: ' || s.description AS text_content
FROM
    {{ source('zendesk', 'support_tickets') }} s
JOIN
    {{ ref('dim_customers') }} c ON s.customer_id = c.customer_id

{% if is_incremental() %}
  -- this filter avoids re-embedding old documents
  WHERE s.updated_at > (select max(updated_at) from {{ this }})
{% endif %}

The RAG Application: Leveraging the Unified Stack

With this foundation, the RAG application becomes much more powerful. When a user asks LangChain, “What were the main issues for our enterprise customers last month?”, the process is:

LangChain parses the query to identify the semantic part (“main issues”) and the structured filters (subscription_tier = 'enterprise', created_at within last month).
It queries Pinecone with the embedding for “main issues” and passes the structured criteria to Pinecone’s metadata filtering capability.
Pinecone returns only the most relevant documents from the correct customer tier and time period.
The retrieved documents are passed to the LLM for a high-quality, accurate summary. This hybrid search is dramatically more accurate and efficient than a pure vector search.

Expert Insight: The Embedding Pipeline as a Production System The “Embedding Pipeline” in the diagram is a mission-critical component. It needs to be a robust, observable, and scalable system. In production, this is often an orchestrated workflow (e.g., using Airflow, Prefect, or Dagster) that triggers on a schedule or when the dbt models are updated. It reads the docs_to_embed table, calls an embedding model API (like OpenAI or a self-hosted model), and upserts the vectors and metadata to Pinecone. Treating this pipeline with the same engineering rigor as your core data pipelines is essential for keeping your AI’s knowledge up-to-date.

The ActiveWizards Advantage: Engineering Your Unified Data & AI Strategy

The separation between the Modern Data Stack and the AI Stack is artificial and detrimental. The future of enterprise intelligence lies in their unification. Achieving this requires a deep, integrated understanding of both worlds: the discipline of data modeling, transformation, and governance from the MDS, and the complexities of vector databases, embedding models, and agentic workflows from the AI Stack.

At ActiveWizards, this is our native territory. We architect and build these unified platforms, ensuring your AI is not an isolated experiment but a fully integrated, data-aware component of your core business strategy.

Unify Your Data and AI Stacks

Stop building data silos. Let’s design a unified architecture that maximizes your investment in the Modern Data Stack to power a new generation of intelligent, data-aware RAG applications. Contact our experts to get started.

RAG Architecture with dbt, LangChain, and the Modern Data Stack

The Great Divide: Two Stacks, One Goal

The Bridge: A Unified Architecture for Data-Aware RAG

The Role of dbt: “T” in ELT for Your AI Stack

The RAG Application: Leveraging the Unified Stack

The ActiveWizards Advantage: Engineering Your Unified Data & AI Strategy

Unify Your Data and AI Stacks

Deploy this architecture

Igor Bobriakov

AI Agents & Autonomous Systems

Codebase Analysis Agent: 30 Seconds to First Answer

Related Articles

Context Engineering for Production Agents: The Discipline Replacing Prompt Engineering

Text-to-SQL Agent Architecture: Accurate, Secure, and Production-Ready

The Production-Ready RAG Pipeline: An Engineering Checklist