Runbooks

Common ops procedures. Each is copy-paste ready — run top to bottom.

Wipe DB (fresh-start testing)

When you want to reset to a clean state for testing:

# 1. Backup first (always)
ssh jaeger "docker exec lumen-postgres pg_dump -U lumen -d lumen \
  | gzip > /tmp/lumen-backup-$(date +%Y%m%d-%H%M%S).sql.gz"

# 2. Wipe data (keep schema + llm_providers config)
ssh jaeger "docker exec -i lumen-postgres psql -U lumen -d lumen" <<'SQL'
BEGIN;
TRUNCATE TABLE
  audit_log, chunks, conversations, messages,
  documents, group_members, groups, project_grants,
  project_memories, project_members, projects,
  share_requests, share_tokens, users, departments
RESTART IDENTITY CASCADE;
UPDATE bootstrap_state SET is_initialized = false;
COMMIT;
SQL

# 3. Clear uploads
ssh jaeger "docker exec lumen-api sh -c 'rm -rf /data/uploads/*'"

# 4. Verify
curl https://lumen-api.zenmail.my.id/bootstrap/status
# Expect: {"isInitialized":false}

Visit https://ai-kb.zenmail.my.id — middleware redirects to /onboarding/first-login.

Fresh deploy from scratch

# 1. Ensure repo is clean
cd ~/ai-knowledge-base
git status
git pull origin master

# 2. Trigger redeploy via Dokploy API
ssh jaeger "curl -s -b /tmp/dk_cookie -X POST \
  http://localhost:3000/api/trpc/compose.redeploy \
  -H 'Content-Type: application/json' \
  -d '{\"json\":{\"composeId\":\"-lHrFrWxG8415d10zZ3j0\"}}'"

# 3. Watch
watch -n 5 "ssh jaeger 'docker ps --filter name=lumen --format \"{{.Names}}\\t{{.Status}}\"'"

Expect all services Up X seconds → Up X minutes → stable Up Y minutes.

Rotate JWT secret

Warning: invalidates every access + refresh token. Every user will have to re-login.

# Generate new secret
NEW_SECRET=$(openssl rand -base64 32)

# Update in Dokploy env via web UI:
#   Apps → lumen-api → Environment → JWT_SECRET = <new value>

# Trigger redeploy of just the API
ssh jaeger "curl -s -b /tmp/dk_cookie -X POST \
  http://localhost:3000/api/trpc/compose.redeploy \
  -H 'Content-Type: application/json' \
  -d '{\"json\":{\"composeId\":\"-lHrFrWxG8415d10zZ3j0\"}}'"

Revert a bad commit

# Find the bad commit
git log --oneline -10

# Create a revert
git revert <sha> --no-edit

# Push — Dokploy auto-deploys
git push origin master

If the frontend is completely broken, users will see the cached Next output until Dokploy rebuilds (~3-6 min).

Clear Redis queue

ssh jaeger "docker exec lumen-redis redis-cli FLUSHDB"

Any in-flight document processing jobs will be lost — they'll need to be re-uploaded. Do NOT run this while users are uploading.

Force re-embed all documents

When the embedder model changes (e.g. upgrade from all-MiniLM-L6-v2 to multilingual-e5-small):

# 1. Delete existing chunks
ssh jaeger "docker exec -i lumen-postgres psql -U lumen -d lumen" <<'SQL'
TRUNCATE TABLE chunks RESTART IDENTITY;
UPDATE documents SET status = 'pending' WHERE status = 'indexed';
SQL

# 2. Re-enqueue all documents via the script
cd ~/ai-knowledge-base/scripts
python embed_chunks.py --all

Check a user's access

ssh jaeger "docker exec lumen-postgres psql -U lumen -d lumen -c \
  \"SELECT id, email, platform_role, org_position, department_id \
    FROM users WHERE email = 'foo@bar.com';\""

To compute their full access tier on a project, look at:

Their platform_role (admin/engineer/superadmin → full on everything)
Their org_position (ceo → use on everything)
projects.owner_id = user.id → full
Rows in project_grants matching user/group/dept
projects.is_private — if false, use baseline for everyone

See Resolver priority.

Tail logs during a deploy

# API logs
ssh jaeger "docker logs -f --tail 50 lumen-api"

# Web logs
ssh jaeger "docker logs -f --tail 50 lumen-web"

# Worker logs (document processing)
ssh jaeger "docker logs -f --tail 50 lumen-worker"

# Embedder (model loading)
ssh jaeger "docker logs -f --tail 50 lumen-embedder"

"Authentication Fails" on chat

Classic symptom: chat returns "Error: Failed to generate response" and API logs show:

[LLM] API error: 401 Authentication Fails (auth header format should be Bearer sk-...)
[LLM] resolveProvider -> url=... keyPrefix=(empty) keyLen=0

The resolver couldn't find an API key. Causes:

No LLM provider configured at all → add one at /engineer/providers
Provider not marked is_default = true AND chat sent a bare model name
Bash escaping corrupted an API key in a DB UPDATE (happened during issue #3)

Fix: check llm_providers.api_key actually starts with sk- (or similar prefix). If corrupted, re-save via the Edit provider modal.

ssh jaeger "docker exec lumen-postgres psql -U lumen -d lumen -c \
  \"SELECT name, substring(api_key, 1, 10) AS prefix, length(api_key) AS len \
    FROM llm_providers;\""

Good: prefix=sk-abc1234..., len=51. Bad: prefix=, len=0 or short weird prefixes.

Smoke test the whole stack

# Run from your laptop (prod)
TOKEN=$(curl -s -X POST "https://lumen-api.zenmail.my.id/auth/login" \
  -H 'Content-Type: application/json' \
  -d '{"email":"...","password":"..."}' | jq -r .accessToken)

echo "=== bootstrap ==="
curl -s "https://lumen-api.zenmail.my.id/bootstrap/status"

echo "=== projects ==="
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://lumen-api.zenmail.my.id/projects" | jq '.projects | length'

echo "=== providers ==="
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://lumen-api.zenmail.my.id/providers" | jq '.providers | length'

echo "=== chat config ==="
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://lumen-api.zenmail.my.id/chat/config"

All four should return non-error JSON. If any returns {error: ...} or 5xx, start with docker logs lumen-api.