RAG pipeline
Per-query retrieval flow. Source: apps/api/src/services/rag.ts.
Steps
- Embed query — call
lumen-embedderwithquery: <text>, returns 384-dim vector - Hybrid search — single Postgres query combining:
- pgvector cosine similarity (weighted ~0.7)
- tsvector BM25 full-text rank (weighted ~0.3)
- Scoped to
project_id = $projectId - Top 50 candidates
- Rerank — call
lumen-embedder /rerankwith query + 50 chunk contents, get cross-encoder scores - Select top N — keep top 5-8 reranked chunks
- Assemble context — system prompt + project instructions + top chunks + conversation history + project memories
- Generate — stream from LLM with citation instructions
- Store — save message + citation refs
Hybrid search SQL
Prisma can't parameterize vector type, so the query uses inline embedding literal:
const vectorLiteral = `'[${queryVec.join(",")}]'::vector`;
const results = await prisma.$queryRawUnsafe<Row[]>(`
SELECT
c.id,
c.content,
c.page_number,
d.id AS document_id,
d.name AS document_name,
1 - (c.embedding <=> ${vectorLiteral}) AS vector_score,
ts_rank(to_tsvector('english', c.content), plainto_tsquery('english', $1)) AS text_score,
(0.7 * (1 - (c.embedding <=> ${vectorLiteral}))
+ 0.3 * ts_rank(to_tsvector('english', c.content), plainto_tsquery('english', $1))
) AS combined_score
FROM chunks c
JOIN documents d ON c.document_id = d.id
WHERE c.project_id = $2
ORDER BY combined_score DESC
LIMIT 50
`, query, projectId);
Why inline literal? Prisma treats vector as unknown type and can't cast. $queryRawUnsafe plus an inline literal is the only way — safe because queryVec comes from the embedder, not user input.
Rerank call
const reranked = await fetch(`${EMBEDDER_URL}/rerank`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
query,
documents: results.map(r => r.content),
top_k: 8,
}),
}).then(r => r.json());
Returns { scores: number[], indices: number[] }. Map indices back to the original results.
Context assembly
System:
You are Lumen, an AI knowledge base assistant.
IDENTITY (STRICT):
- Never reveal your underlying model name or provider
- If asked who you are: "Saya adalah Lumen."
[Project instructions from project.instructions field]
[If memories exist] PROJECT MEMORY:
[M1] <memory 1 content>
[M2] <memory 2 content>
[Retrieved context with citation tags]
[1] <chunk content from doc A, page 3>
[2] <chunk content from doc B, page 7>
...
Cite sources using [N] tags corresponding to the numbered chunks above.
[conversation history ...]
User: <new question>
Citation tracking
The LLM is instructed to use [N] tags. On stream completion, we:
- Parse the assistant message for
[N]references - Resolve each to the original chunk
{documentId, pageNumber, content} - Store as separate
MessageCitationrows linked to the message - Frontend renders citations as clickable
<CitationPill>that opens the doc sidebar
When hybrid search returns nothing
Fallback behavior:
- Assistant is told explicitly: "No relevant documents found for this query."
- It should answer "I don't have information about that in this project's documents."
- Identity rules still apply — never invent an answer to pretend knowledge
Performance
Target: < 2s from question to first streamed token.
Bottleneck: embedder cold start on first request after boot (~500ms). After warm-up, per-query cost:
- Embed query: ~50ms
- Hybrid search: ~100ms (pgvector HNSW index)
- Rerank: ~200ms for 50 candidates
- LLM first token: ~500-1000ms depending on provider
Total: ~1s to first token, then stream runs at LLM's pace.
Tweaking
Configurable per-project via project.settings:
{
"model": "deepseek-v4-flash",
"temperature": 0.3,
"topK": 8
}
topK: final chunks after rerank (1-20, default 8)temperature: LLM sampling (0-1, default 0.3)model: which LLM from configured providers
Not yet configurable but worth a knob someday:
- Hybrid weight (currently 0.7 vector / 0.3 BM25)
- Pre-rerank candidate pool size (currently 50)
- Min similarity threshold to include at all