Version: 0.1.0

Build a RAG Chatbot in Lab

In the Lab, upload documents to a Volume, index them with an embedding server and a Vector DB, then build a RAG backend (FastAPI) and a Streamlit UI as images and deploy them as NuFi Servings. The final end-to-end check is performed in the externally exposed Streamlit UI.

Why don't we use the Playground?

APIs exposed externally by NuFi Serving only accept the OpenAI-compatible spec. This tutorial's RAG backend is written with a custom spec (POST /query) that freely carries sources/scores, so it can't be opened directly in the Playground. Instead, keep the RAG backend for internal calls only and use the Streamlit UI Serving as the external exposure point.

Composition

Volume: tutorial-volume
  └─ /data/docs            # Documents uploaded from the Lab

Lab (tutorial-volume mounted at /data)
  └─ Document chunking, embedding, Vector DB loading

Serving: tutorial-embedding
  └─ Converts documents and questions into vectors (OpenAI-compatible /v1/embeddings)

Vector DB
  └─ collection: nufi-rag-tutorial

Serving: tutorial-llm
  └─ vLLM-based LLM API (OpenAI-compatible /v1/chat/completions)

Serving: tutorial-rag-backend            ← Internal calls only
  └─ POST /query
  └─ Vector DB search → prompt construction → LLM call → returns answer + sources

Serving: tutorial-rag-ui                 ← External exposure point
  └─ Streamlit
  └─ Calls tutorial-rag-backend → renders answer + sources

Prerequisites

Create the Volume tutorial-volume for this tutorial by following Volumes
Create a Lab with a GPU or NPU by following Lab (mount the above Volume at /data)
Deploy a vLLM-based LLM Serving by following one of the E2E model serving scenarios. This tutorial refers to that LLM Serving as tutorial-llm. The actual Serving name is auto-generated in the form <model-name>-<version>-<artifact> at Quick Deploy time (e.g., qwen-instruct-tutorial-v1-original) and is visible in the Development > Serving list. Replace the host portion of LLM_BASE_URL in the environment variables with the actual name in your environment.
Vector DB connection info received from the administrator (also viewable on the Vector DB deployment page)
An image build environment (Docker/Podman) and a registry you can push to

1. Upload documents to the Lab

In the Volume detail page, click the Files tab, where you'll find an Upload button. Use it to upload the documents you want to use for RAG to tutorial-volume.

Volume Files Upload

Supported document formats

The indexing code at the bottom of this tutorial does not support parsing PDF/DOCX. Use text files only (.md, .txt, etc.).

2. Deploy the Embedding Serving

Go to Development > Serving > Create and deploy the embedding server.

Step 1 — Basic information

Field	Value
Service Name	`tutorial-embedding`
Description	`RAG embedding server`
Service Template	`Custom`

Step 2 — Detailed settings

Field	Value
Image	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7`
CPU	`2`
Memory	`16`
Accelerator Type	`None`
Replicas	`1`
Inference Port	`80`

Command / Args

--model-id
intfloat/multilingual-e5-base

After deployment, in the Development > Serving list, verify that tutorial-embedding is in the Running state.

Internal call URL

When calling from another Lab or Serving, use port 80 of the Service that NuFi created.

http://tutorial-embedding.<project-namespace>.svc:80

This URL is used as OPENAI_API_BASE (EMBEDDING_BASE_URL) in the indexing script in step 3. Load documents into the Vector DB from the Lab below.

3. Load documents into the Vector DB from the Lab

Get the Vector DB connection info (URL, API Key, Collection, etc.) from the administrator. Examples in this tutorial use Qdrant.

Install the required packages in the Lab terminal.

pip install langchain langchain-community langchain-openai langchain-qdrant qdrant-client

Use the script below to perform document loading, chunking, embedding, and Vector DB loading with LangChain. The actual question-answering is handled by the RAG backend Serving deployed later.

Change PROJECT_NAMESPACE and VECTOR_DB_URL to match your environment. If the Vector DB requires an API key, add api_key="..." to QdrantClient.

from pathlib import Path

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

PROJECT_NAMESPACE = "<project-namespace>"  # Change to your project namespace (visible in the project namespace area of the top header)
DOCS_DIR = Path("/data/docs")

EMBEDDING_BASE_URL = f"http://tutorial-embedding.{PROJECT_NAMESPACE}.svc:80/v1"
EMBEDDING_MODEL = "intfloat/multilingual-e5-base"

VECTOR_DB_URL = "http://qdrant.vector-db.svc:6333"
COLLECTION = "nufi-rag-tutorial"
VECTOR_SIZE = 768

qdrant = QdrantClient(url=VECTOR_DB_URL)

if not qdrant.collection_exists(COLLECTION):
    qdrant.create_collection(
        collection_name=COLLECTION,
        vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
    )

loader = DirectoryLoader(
    str(DOCS_DIR),
    glob="**/*",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True,
)
docs = loader.load()

chunks = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80,
).split_documents(docs)

embeddings = OpenAIEmbeddings(
    model=EMBEDDING_MODEL,
    base_url=EMBEDDING_BASE_URL,
    api_key="unused",
)

vector_store = QdrantVectorStore(
    client=qdrant,
    collection_name=COLLECTION,
    embedding=embeddings,
)
vector_store.add_documents(chunks)

print(f"indexed {len(chunks)} chunks into {COLLECTION}")

You can find the PROJECT_NAMESPACE value in the project namespace area of the top header.

Project Namespace location

On a successful run, you'll see the message indexed ... chunks into nufi-rag-tutorial.

4. Inspect RAG backend and UI code (optional)

The RAG backend is a FastAPI server that takes the user's question, searches the Vector DB, and calls the vLLM Serving with the results inserted into the prompt. Since it is internal-call only inside the cluster, it doesn't follow the OpenAI spec and uses a custom spec.

The Streamlit UI calls the RAG backend and renders the answer and sources. Because the RAG backend is not exposed externally, the UI serves as the user touch point.

Pre-built images provided

This tutorial provides pre-built images. You can proceed to the next step without building yourself.

registry.dudaji.com/nufi/tutorial-rag-backend:0.1.0
registry.dudaji.com/nufi/tutorial-rag-ui:0.1.0

API spec

POST /query

Request: { "question": str, "top_k": int(optional, default 4) }

Response:

{
  "answer": "...",
  "sources": [
    { "title": "company-guide.md", "score": 0.82, "snippet": "...", "uri": "/data/docs/company-guide.md" }
  ],
  "latency_ms": 1234
}

GET /healthz — health check

If you want to inspect the image's source or build and test it yourself, expand below.

🐍 Full RAG backend (FastAPI) code

Directory layout

/data/backend/
  ├─ app.py
  ├─ requirements.txt
  └─ Dockerfile

app.py

import os
import time
from typing import List

from fastapi import FastAPI
from pydantic import BaseModel

from openai import OpenAI
from qdrant_client import QdrantClient

LLM_BASE_URL = os.environ["LLM_BASE_URL"]
LLM_MODEL = os.environ["LLM_MODEL"]
EMBEDDING_BASE_URL = os.environ["EMBEDDING_BASE_URL"]
EMBEDDING_MODEL = os.environ["EMBEDDING_MODEL"]
VECTOR_DB_URL = os.environ["VECTOR_DB_URL"]
VECTOR_DB_API_KEY = os.environ.get("VECTOR_DB_API_KEY", "")
COLLECTION = os.environ["VECTOR_COLLECTION"]
DEFAULT_TOP_K = int(os.environ.get("TOP_K", "4"))

llm = OpenAI(base_url=LLM_BASE_URL, api_key="unused")
embedder = OpenAI(base_url=EMBEDDING_BASE_URL, api_key="unused")
qdrant = QdrantClient(url=VECTOR_DB_URL, api_key=VECTOR_DB_API_KEY or None)

SYSTEM_PROMPT = """You are an assistant that answers based on internal company documents.
Use only the given documents to answer, and if there is no basis, say "Not found in the documents." """


class QueryRequest(BaseModel):
    question: str
    top_k: int | None = None


class Source(BaseModel):
    title: str
    score: float
    snippet: str
    uri: str | None = None


class QueryResponse(BaseModel):
    answer: str
    sources: List[Source]
    latency_ms: int


app = FastAPI()


@app.get("/healthz")
def healthz():
    return {"ok": True}


@app.post("/query", response_model=QueryResponse)
def query(req: QueryRequest):
    started = time.time()
    top_k = req.top_k or DEFAULT_TOP_K

    emb = embedder.embeddings.create(model=EMBEDDING_MODEL, input=req.question)
    qvec = emb.data[0].embedding

    hits = qdrant.search(
        collection_name=COLLECTION,
        query_vector=qvec,
        limit=top_k,
        with_payload=True,
    )

    sources: List[Source] = []
    context_blocks: List[str] = []
    for h in hits:
        payload = h.payload or {}
        text = payload.get("page_content", "")
        meta = payload.get("metadata", {}) or {}
        title = meta.get("source", "unknown")
        sources.append(
            Source(
                title=str(title).split("/")[-1],
                score=float(h.score),
                snippet=text[:240],
                uri=str(title),
            )
        )
        context_blocks.append(f"[{title}]\n{text}")

    context = "\n\n---\n\n".join(context_blocks) if context_blocks else "(no search results)"
    user_prompt = f"Documents:\n{context}\n\nQuestion: {req.question}"

    chat = llm.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,
    )

    answer = chat.choices[0].message.content or ""
    latency_ms = int((time.time() - started) * 1000)
    return QueryResponse(answer=answer, sources=sources, latency_ms=latency_ms)

requirements.txt

fastapi==0.115.0
uvicorn[standard]==0.30.6
openai==1.51.0
httpx==0.27.2
qdrant-client==1.11.3
pydantic==2.9.2

Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build and push directly

cd /data/backend
docker build -t <registry>/tutorial-rag-backend:0.1.0 .
docker push <registry>/tutorial-rag-backend:0.1.0

🐍 Full Streamlit UI code

Directory layout

/data/ui/
  ├─ app.py
  ├─ requirements.txt
  └─ Dockerfile

app.py

import os
import requests
import streamlit as st

BACKEND_URL = os.environ["BACKEND_URL"]  # e.g., http://tutorial-rag-backend.<ns>.svc:80

st.set_page_config(page_title="NuFi RAG Chatbot", page_icon="📚")
st.title("📚 NuFi RAG Chatbot")

if "history" not in st.session_state:
    st.session_state.history = []

for turn in st.session_state.history:
    with st.chat_message(turn["role"]):
        st.markdown(turn["content"])
        if turn["role"] == "assistant" and turn.get("sources"):
            with st.expander("Sources"):
                for s in turn["sources"]:
                    st.markdown(f"**{s['title']}**  (score: {s['score']:.3f})")
                    st.caption(s["snippet"])

question = st.chat_input("Ask a question about the documents")
if question:
    st.session_state.history.append({"role": "user", "content": question})
    with st.chat_message("user"):
        st.markdown(question)

    with st.chat_message("assistant"):
        with st.spinner("Searching and generating response..."):
            resp = requests.post(
                f"{BACKEND_URL}/query",
                json={"question": question},
                timeout=120,
            )
            resp.raise_for_status()
            data = resp.json()

        st.markdown(data["answer"])
        if data.get("sources"):
            with st.expander("Sources"):
                for s in data["sources"]:
                    st.markdown(f"**{s['title']}**  (score: {s['score']:.3f})")
                    st.caption(s["snippet"])

    st.session_state.history.append({
        "role": "assistant",
        "content": data["answer"],
        "sources": data.get("sources", []),
    })

requirements.txt

streamlit==1.39.0
requests==2.32.3

Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .

EXPOSE 8501
CMD ["streamlit", "run", "app.py", \
     "--server.port=8501", \
     "--server.address=0.0.0.0", \
     "--server.enableCORS=false", \
     "--server.enableXsrfProtection=false", \
     "--server.enableWebsocketCompression=false", \
     "--browser.gatherUsageStats=false"]

Build and push directly

cd /data/ui
docker build -t <registry>/tutorial-rag-ui:0.1.0 .
docker push <registry>/tutorial-rag-ui:0.1.0

5. Deploy the RAG backend Serving

Go to Development > Serving > Create.

Step 1 — Basic information

Field	Value
Service Name	`tutorial-rag-backend`
Description	`RAG backend (internal)`
Service Template	`Custom`

Step 2 — Detailed settings

Field	Value
Image	`registry.dudaji.com/nufi/tutorial-rag-backend:0.1.0`
CPU	`1`
Memory	`2`
Accelerator Type	`None`
Replicas	`1`
Inference Port	`8000`

Environment Variables

LLM_BASE_URL=http://qwen-instruct-tutorial-v1-original.rag.svc:80/v1
LLM_MODEL=/mnt/rag-volume/Qwen2.5-0.5B-Instruct
EMBEDDING_BASE_URL=http://tutorial-embedding.rag.svc:80/v1
EMBEDDING_MODEL=intfloat/multilingual-e5-base
VECTOR_DB_URL=http://qdrant.vector-db.svc:6333
VECTOR_COLLECTION=nufi-rag-tutorial
TOP_K=4

LLM_BASE_URL — form http://<llm-service-name>.<project-namespace>.svc:80/v1. Internal call URL for the LLM Serving in use. Replace <llm-service-name> with the LLM Serving name and <project-namespace> with the current project namespace.
LLM_MODEL — the model ID passed to vLLM via --model. Copy the value after --model from Command/Args on the Serving detail page (when --served-model-name is not used).
EMBEDDING_BASE_URL — form http://tutorial-embedding.<project-namespace>.svc:80/v1. Internal call URL of the embedding Serving deployed in step 2. Replace <project-namespace> with the current project namespace.
EMBEDDING_MODEL — the model ID used by the embedding Serving.
VECTOR_DB_URL — Vector DB connection URL. Adjust namespace/port if different. If an API key is needed, add VECTOR_DB_API_KEY.
VECTOR_COLLECTION — must match the collection name created by the indexing script in step 3.
TOP_K — number of top chunks to retrieve.

After deployment, wait until tutorial-rag-backend reaches Running. The external URL is not used — calls happen from within the same namespace only.

6. Deploy the Streamlit UI Serving

Go to Development > Serving > Create.

Step 1 — Basic information

Field	Value
Service Name	`tutorial-rag-ui`
Description	`RAG chatbot UI`
Service Template	`Custom`

Step 2 — Detailed settings

Field	Value
Image	`registry.dudaji.com/nufi/tutorial-rag-ui:0.1.0`
CPU	`1`
Memory	`2`
Accelerator Type	`None`
Replicas	`1`
Inference Port	`8501`

Environment Variables

BACKEND_URL=http://tutorial-rag-backend.rag.svc:80

BACKEND_URL — form http://tutorial-rag-backend.<project-namespace>.svc:80. Internal call URL of the RAG backend Serving deployed in step 5. Replace <project-namespace> with the same project namespace as the backend.

After deployment, when it reaches Running, check the Connect URL on the Serving detail page.

7. Check the RAG response

In the Serving list, find the tutorial-rag-ui Serving and press the Connect button.

RAG Serving list

When the chat screen opens, ask a question related to the documents you uploaded in step 1 to get an answer and the sources.

RAG chat screen

When working, the response is shown along with the Sources panel listing the uploaded document name (e.g., company-guide.md — the document name you uploaded earlier), score, and snippet.

Response quality

The LLM deployed in this tutorial is Qwen2.5-0.5B-Instruct. As a 0.5B-class small model, even when grounded on RAG-retrieved documents the answers may be inaccurate or inconsistent. For more accurate responses, swap the LLM Serving to a larger model.

Composition​

Prerequisites​

1. Upload documents to the Lab​

2. Deploy the Embedding Serving​

Step 1 — Basic information​

Step 2 — Detailed settings​

3. Load documents into the Vector DB from the Lab​

4. Inspect RAG backend and UI code (optional)​

API spec​

5. Deploy the RAG backend Serving​

Step 1 — Basic information​

Step 2 — Detailed settings​

Environment Variables​

6. Deploy the Streamlit UI Serving​

Step 1 — Basic information​

Step 2 — Detailed settings​

Environment Variables​

7. Check the RAG response​

Composition

Prerequisites

1. Upload documents to the Lab

2. Deploy the Embedding Serving

Step 1 — Basic information

Step 2 — Detailed settings

3. Load documents into the Vector DB from the Lab

4. Inspect RAG backend and UI code (optional)

API spec

5. Deploy the RAG backend Serving

Step 1 — Basic information

Step 2 — Detailed settings

Environment Variables

6. Deploy the Streamlit UI Serving

Step 1 — Basic information

Step 2 — Detailed settings

Environment Variables

7. Check the RAG response