RAG-Powered Local LLM Assistant

Your own private AI assistant with retrieval memory, running entirely on local storage with complete data protection

Published on Jan 03, 2025

Reading time: 5 minutes.


Built with


What makes this special?

  • Complete Privacy: Your AI assistant runs entirely on your local infrastructure - no data leaves your network
  • Persistent Memory: Chroma vector database remembers everything you teach it, creating a truly personalized AI
  • Local LLM Power: Ollama runs powerful language models locally, no internet required for AI responses
  • Your Data, Your Control: All documents, conversations, and embeddings stored securely on your own storage
  • Zero Subscription Costs: Open-source stack with no ongoing fees or usage limits
  • Self-Hosted Excellence: Full control over your AI assistant’s capabilities and data retention

How does it work?

  • Smart Document Processing: Upload your documents and Chroma automatically creates searchable knowledge
  • Intelligent Memory: Your AI remembers past conversations and can reference your uploaded documents
  • Local Processing: Everything runs on your hardware - documents, conversations, and AI responses stay private
  • Easy Setup: One-command deployment with Docker Compose for hassle-free installation
  • Model Flexibility: Choose from various open-source LLM models that run entirely offline
  • Real-time Monitoring: Track your AI’s performance and resource usage with built-in dashboards

What you need

  • Docker & Docker Compose installed
  • 8GB+ RAM (for running AI models locally)
  • Basic familiarity with Docker commands
  • Your own documents to create a personalized AI knowledge base

Source Code

https://github.com/Lforlinux/Opensource-LLM-RAG-Stack

How to deploy the infrastructure

Quick Start Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Clone the repository
git clone https://github.com/Lforlinux/Opensource-LLM-RAG-Stack.git
cd Opensource-LLM-RAG-Stack

# Quick start (includes model setup)
./start.sh

# Or manual setup:
# Start all services (includes Ollama)
docker-compose up -d

# Set up Ollama with a model
./scripts/setup-ollama.sh

Docker Compose Architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports: ["3000:8080"]
    environment:
      - OLLAMA_API_BASE_URL=http://ollama:11434
      - VECTOR_DB=chroma
      - DATABASE_URL=postgresql://user:password@postgres:5432/chatdb

  chroma:
    image: ghcr.io/chroma-core/chroma:latest
    ports: ["8000:8000"]
    environment:
      - CHROMA_DB_IMPL=duckdb+parquet

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: chatdb

Architecture

RAG LLM Stack Architecture

RAG Stack Components

  • Open WebUI: User interface for chat interactions and document management
  • Ollama: Containerized LLM inference engine with model management
  • Chroma: Vector database for semantic search and embeddings storage
  • PostgreSQL: Relational database for chat history and document metadata
  • Prometheus: Metrics collection and monitoring
  • Grafana: Visualization and dashboard management

Key Features

RAG Implementation

  • Document Processing: Automatic chunking and embedding generation
  • Semantic Search: Vector similarity search for relevant context retrieval
  • Context Augmentation: Dynamic prompt enhancement with retrieved information
  • Chat History: Persistent conversation management with PostgreSQL
  • Model Management: Easy model switching and versioning with Docker volumes

Monitoring & Observability

  • Service Health: Real-time monitoring of all stack components
  • Performance Metrics: Request rates, response times, and resource usage
  • Database Monitoring: PostgreSQL performance and query optimization
  • Vector DB Metrics: Chroma collection health and search performance
  • Grafana Dashboards: Pre-configured dashboards for comprehensive monitoring

Database Schema & Architecture

PostgreSQL Schema

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Chat Sessions Management
CREATE TABLE chat_sessions (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    user_id VARCHAR(255) NOT NULL,
    session_name VARCHAR(255),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Message Storage with Full-Text Search
CREATE TABLE chat_messages (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    session_id UUID REFERENCES chat_sessions(id),
    role VARCHAR(50) CHECK (role IN ('user', 'assistant', 'system')),
    content TEXT NOT NULL,
    token_count INTEGER DEFAULT 0
);

-- RAG Document Storage
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    title VARCHAR(500),
    content TEXT NOT NULL,
    source VARCHAR(500),
    embedding_id VARCHAR(255), -- Chroma reference
    metadata JSONB DEFAULT '{}'::jsonb
);

-- Performance Indexes
CREATE INDEX idx_documents_content_gin ON documents 
USING gin(to_tsvector('english', content));

RAG Implementation Guide

Document Upload & Processing

  1. Access Open WebUI: Navigate to http://localhost:3000
  2. Upload Documents: Support for PDF, TXT, and other formats
  3. Automatic Processing: System chunks documents and generates embeddings
  4. Vector Storage: Embeddings stored in Chroma for semantic search

Query with RAG

  1. User Query: Ask questions in Open WebUI interface
  2. Context Retrieval: System retrieves relevant chunks from Chroma
  3. Prompt Augmentation: Retrieved context enhances user prompts
  4. LLM Generation: Ollama generates responses using augmented context

Monitoring & Observability

Prometheus Metrics

  • Service Health: up{job=~"prometheus|postgres_exporter"}
  • Database Performance: PostgreSQL exporter metrics
  • Request Rates: HTTP request monitoring
  • Resource Usage: Container and system metrics

Grafana Dashboards

Grafana Monitoring Dashboard

  • RAG Stack Overview: Service health and performance
  • Database Metrics: PostgreSQL performance monitoring
  • System Resources: CPU, memory, and disk usage
  • Request Analytics: API call patterns and response times

Production Deployment

Environment Configuration

1
2
3
4
# Production environment variables
export POSTGRES_PASSWORD=secure_password
export GRAFANA_ADMIN_PASSWORD=secure_admin_password
export OLLAMA_API_BASE_URL=https://your-ollama-instance.com

Scaling Considerations

  • Horizontal Scaling: Multiple Ollama instances behind load balancer
  • Database Scaling: PostgreSQL read replicas for query performance
  • Vector DB Scaling: Chroma clustering for high availability
  • Monitoring: Prometheus federation for multi-instance monitoring

Security Best Practices

Infrastructure Security

  • Network Isolation: Container network security and service isolation
  • Environment Configuration: Secure environment variable management
  • Data Encryption: Encryption at rest and in transit
  • Access Control: Proper authentication and authorization

Data Protection

  • Backup Strategy: Automated backup for all persistent data
  • Data Privacy: No sensitive data logging in production
  • Secure Communication: HTTPS/TLS for all service communications
  • Container Security: Regular image updates and vulnerability scanning

Troubleshooting

Common Issues

  1. RAG Not Working - Document Upload Issues

    1
    2
    
    # Check Chroma connection
    curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8000/api/v2/heartbeat
    
  2. Database Connection Issues

    1
    2
    3
    4
    5
    
    # Check PostgreSQL status
    docker-compose logs postgres
    
    # Verify database initialization
    docker exec -it postgres psql -U user -d chatdb -c "\dt"
    
  3. Model Loading Problems

    1
    2
    3
    4
    5
    
    # Check Ollama service
    docker-compose logs ollama
    
    # Verify model availability
    curl http://localhost:11434/api/tags
    

Future Enhancements

Planned Features

  • Multi-Model Support: Support for multiple LLM providers
  • Advanced RAG: Hybrid search with keyword and semantic matching
  • API Integration: RESTful API for external system integration
  • Multi-Tenant Support: Isolated environments for different users

Technical Improvements

  • High Availability: Multi-instance deployment with load balancing
  • Performance Optimization: Query optimization and caching strategies
  • Security Hardening: Enhanced authentication and authorization
  • Monitoring Enhancement: Advanced alerting and anomaly detection

Contributing

Development Setup

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/your-feature
  3. Make changes and test locally
  4. Commit changes: git commit -m "Add your feature"
  5. Push to branch: git push origin feature/your-feature
  6. Create Pull Request

Code Standards

  • Docker: Container optimization and security best practices
  • Database: PostgreSQL performance and schema optimization
  • Monitoring: Prometheus metrics and Grafana dashboard standards
  • Documentation: Clear setup and troubleshooting guides

Conclusion

This OpenSource LLM RAG Stack project demonstrates enterprise-grade AI infrastructure practices, showcasing:

  • Production-Ready RAG System with comprehensive monitoring
  • Containerized Microservices architecture for scalability
  • Vector Database Integration for semantic search capabilities
  • Observability with Prometheus and Grafana monitoring
  • Enterprise DevOps practices with Infrastructure as Code

The project serves as both a functional RAG system and a comprehensive example of modern AI infrastructure, making it an excellent addition to any AI/ML engineer’s portfolio.

Source Code: https://github.com/Lforlinux/Opensource-LLM-RAG-Stack