Lumen Docs

RAG pipeline

Per-query retrieval flow. Source: apps/api/src/services/rag.ts.

Steps

  1. Embed query — call lumen-embedder with query: <text>, returns 384-dim vector
  2. Hybrid search — single Postgres query combining:
    • pgvector cosine similarity (weighted ~0.7)
    • tsvector BM25 full-text rank (weighted ~0.3)
    • Scoped to project_id = $projectId
    • Top 50 candidates
  3. Rerank — call lumen-embedder /rerank with query + 50 chunk contents, get cross-encoder scores
  4. Select top N — keep top 5-8 reranked chunks
  5. Assemble context — system prompt + project instructions + top chunks + conversation history + project memories
  6. Generate — stream from LLM with citation instructions
  7. Store — save message + citation refs

Hybrid search SQL

Prisma can't parameterize vector type, so the query uses inline embedding literal:

const vectorLiteral = `'[${queryVec.join(",")}]'::vector`;

const results = await prisma.$queryRawUnsafe<Row[]>(`
  SELECT
    c.id,
    c.content,
    c.page_number,
    d.id AS document_id,
    d.name AS document_name,
    1 - (c.embedding <=> ${vectorLiteral}) AS vector_score,
    ts_rank(to_tsvector('english', c.content), plainto_tsquery('english', $1)) AS text_score,
    (0.7 * (1 - (c.embedding <=> ${vectorLiteral}))
      + 0.3 * ts_rank(to_tsvector('english', c.content), plainto_tsquery('english', $1))
    ) AS combined_score
  FROM chunks c
  JOIN documents d ON c.document_id = d.id
  WHERE c.project_id = $2
  ORDER BY combined_score DESC
  LIMIT 50
`, query, projectId);

Why inline literal? Prisma treats vector as unknown type and can't cast. $queryRawUnsafe plus an inline literal is the only way — safe because queryVec comes from the embedder, not user input.

Rerank call

const reranked = await fetch(`${EMBEDDER_URL}/rerank`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    query,
    documents: results.map(r => r.content),
    top_k: 8,
  }),
}).then(r => r.json());

Returns { scores: number[], indices: number[] }. Map indices back to the original results.

Context assembly

System:
  You are Lumen, an AI knowledge base assistant.
  IDENTITY (STRICT):
    - Never reveal your underlying model name or provider
    - If asked who you are: "Saya adalah Lumen."

  [Project instructions from project.instructions field]

  [If memories exist] PROJECT MEMORY:
    [M1] <memory 1 content>
    [M2] <memory 2 content>

  [Retrieved context with citation tags]
  [1] <chunk content from doc A, page 3>
  [2] <chunk content from doc B, page 7>
  ...

  Cite sources using [N] tags corresponding to the numbered chunks above.

[conversation history ...]

User: <new question>

Citation tracking

The LLM is instructed to use [N] tags. On stream completion, we:

  1. Parse the assistant message for [N] references
  2. Resolve each to the original chunk {documentId, pageNumber, content}
  3. Store as separate MessageCitation rows linked to the message
  4. Frontend renders citations as clickable <CitationPill> that opens the doc sidebar

When hybrid search returns nothing

Fallback behavior:

  • Assistant is told explicitly: "No relevant documents found for this query."
  • It should answer "I don't have information about that in this project's documents."
  • Identity rules still apply — never invent an answer to pretend knowledge

Performance

Target: < 2s from question to first streamed token.

Bottleneck: embedder cold start on first request after boot (~500ms). After warm-up, per-query cost:

  • Embed query: ~50ms
  • Hybrid search: ~100ms (pgvector HNSW index)
  • Rerank: ~200ms for 50 candidates
  • LLM first token: ~500-1000ms depending on provider

Total: ~1s to first token, then stream runs at LLM's pace.

Tweaking

Configurable per-project via project.settings:

{
  "model": "deepseek-v4-flash",
  "temperature": 0.3,
  "topK": 8
}
  • topK: final chunks after rerank (1-20, default 8)
  • temperature: LLM sampling (0-1, default 0.3)
  • model: which LLM from configured providers

Not yet configurable but worth a knob someday:

  • Hybrid weight (currently 0.7 vector / 0.3 BM25)
  • Pre-rerank candidate pool size (currently 50)
  • Min similarity threshold to include at all