MindRouter Documentation
MindRouter is a production-ready LLM inference load balancer and translation layer that fronts a heterogeneous cluster of Ollama and vLLM inference backends. It provides a unified OpenAI-compatible API surface with native Ollama compatibility, fair-share scheduling, per-user quotas, full audit logging, and real-time GPU telemetry.
Developed by Luke Sheneman, Research Computing and Data Services (RCDS), Institute for Interdisciplinary Data Sciences (IIDS), University of Idaho.
Overview
MindRouter sits between API consumers and GPU inference servers, providing:
- Unified API Gateway — OpenAI-compatible
/v1/*, Ollama-compatible/api/*, and Anthropic-compatible/anthropic/v1/*endpoints, all backed by the same pool of inference servers. - Cross-Engine Routing — A request arriving in OpenAI, Ollama, or Anthropic format can be served by any backend. The translation layer handles all protocol conversion transparently.
- Fair-Share Scheduling — Weighted Deficit Round Robin (WDRR) ensures equitable GPU access across users in different groups with configurable priorities.
- Multi-Modal Support — Text chat, text completion, embeddings, multimodal models, structured JSON outputs, and tool calling (function calling).
- Per-User Quotas — Token budgets, requests-per-minute limits, and concurrent request caps, with defaults inherited from the user's group.
- Full Audit Logging — Every prompt, response, and token count is recorded for compliance and review.
- Real-Time GPU Telemetry — Per-GPU utilization, memory, temperature, and power metrics via lightweight sidecar agents.
- Web Dashboards — Public status page, user self-service dashboard, admin control panel, and built-in chat interface.
Who It's For
- Research computing centers managing shared GPU clusters for multiple user groups
- Universities providing LLM access to students, staff, and faculty with differentiated quotas
- Organizations needing a unified API gateway across mixed Ollama/vLLM infrastructure
Architecture
MindRouter follows a layered architecture:
Client Request (OpenAI, Ollama, or Anthropic format)
|
v
+-----------------------------+
| API Gateway Layer | ← /v1/*, /api/*, /anthropic/*, /api/admin/*
+-----------------------------+
| Authentication & Quotas | ← API key verification, rate limiting
+-----------------------------+
| Translation Layer | ← OpenAI/Ollama/Anthropic ↔ Canonical ↔ Ollama/vLLM
+-----------------------------+
| Fair-Share Scheduler | ← WDRR with per-user deficit counters
+-----------------------------+
| Backend Registry | ← Health monitoring, model tracking
+-----------------------------+
|
v
+---------------+-------------+
| GPU Node 1 | GPU Node 2 | ...
| +---------+ | +--------+ |
| | Sidecar | | |Sidecar | | ← Per-node GPU metrics agent
| +---------+ | +--------+ |
| | Ollama | | | vLLM | | ← Inference engines
| +---------+ | +--------+ |
+---------------+-------------+
Key concepts:
- A Node is a physical GPU server running a sidecar agent.
- A Backend is an inference endpoint (Ollama or vLLM instance) running on a node. Multiple backends can share a node, each assigned specific GPUs via
gpu_indices.
Getting Started
Prerequisites
- Docker and Docker Compose
- Python 3.11+ (for local development)
Quickstart with Docker Compose
# 1. Clone and configure
git clone <repository-url>
cd mindrouter
cp .env.example .env
nano .env # Set DATABASE_URL, SECRET_KEY, etc.
# 2. Start all services
docker compose up --build
# 3. Run database migrations
docker compose exec app alembic upgrade head
# 4. Seed development data (creates users, quotas, API keys)
docker compose exec app python scripts/seed_dev_data.py
Default Development Credentials
After running the seed script:
| Username | Password | Group | Scheduler Weight |
|---|---|---|---|
admin | admin123 | Admin | 10 |
faculty1 | faculty123 | Faculty | 3 |
staff1 | staff123 | Staff | 2 |
student1 | student123 | Student | 1 |
Accessing the Application
| URL | Description |
|---|---|
http://localhost:8000/ | Public status page |
http://localhost:8000/dashboard | User dashboard (login required) |
http://localhost:8000/admin | Admin dashboard (admin group required) |
http://localhost:8000/chat | Chat interface (login required) |
http://localhost:8000/docs | Interactive API docs (Swagger UI) |
http://localhost:8000/redoc | API reference (ReDoc) |
API Reference
Interactive API Documentation
MindRouter includes built-in interactive API documentation powered by FastAPI:
- Swagger UI at
/docs— Interactive API explorer where you can try endpoints directly from your browser. Supports authentication via the "Authorize" button (enter your API key as a Bearer token). - ReDoc at
/redoc— Clean, readable API reference with request/response schemas and examples.
Both are auto-generated from the application's route definitions and Pydantic models, so they always reflect the current API surface.
Authentication
All inference and admin endpoints require authentication. MindRouter supports two methods:
API Key (Bearer Token):
curl -H "Authorization: Bearer mr2_your-api-key" http://localhost:8000/v1/models
API Key (Header):
curl -H "X-API-Key: mr2_your-api-key" http://localhost:8000/v1/models
Session Cookie (dashboard/admin AJAX only): Browser-based dashboard calls authenticate via the mindrouter_session cookie set at login. This is used internally by the web UI and is not intended for programmatic access.
Error Responses
All error responses follow a consistent format:
{
"detail": "Human-readable error message"
}
Common HTTP status codes:
| Code | Meaning |
|---|---|
| 400 | Invalid request body or parameters |
| 401 | Missing or invalid API key |
| 403 | Insufficient permissions (e.g., non-admin accessing admin endpoint) |
| 404 | Resource not found |
| 409 | Conflict (duplicate name, URL, etc.) |
| 429 | Rate limit exceeded |
| 500 | Internal server error |
Error Response Formats by API Style
Error response formats differ depending on which API style the client is using:
- OpenAI (
/v1/*) — Returns a nested error object:{"error": {"message": "...", "type": "...", "code": ...}} - Ollama (
/api/*) — Returns a plain detail string:{"detail": "..."} - Anthropic (
/anthropic/v1/*) — Returns a typed error object:{"type": "error", "error": {"type": "...", "message": "..."}}
Model Name Matching
Model names in requests are matched exactly against the model catalog. There is no prefix matching, alias resolution, or fuzzy matching. The model field must match a model name as listed by /v1/models or /api/tags.
OpenAI-Compatible Endpoints
These endpoints accept and return data in the OpenAI API format. Any OpenAI-compatible client or SDK can be pointed at MindRouter by changing the base URL.
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /v1/chat/completions | API Key | Chat completions (streaming and non-streaming) |
| POST | /v1/completions | API Key | Text completions (legacy) |
| POST | /v1/embeddings | API Key | Generate embeddings |
| POST | /v1/rerank | API Key | Rerank documents against a query |
| POST | /v1/score | API Key | Score similarity between text pairs |
| GET | /v1/models | API Key | List available models |
Chat Completions
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 500,
"stream": false
}'
Response:
{
"id": "chatcmpl-abc123...",
"object": "chat.completion",
"created": 1700000000,
"model": "llama3.2",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 10,
"total_tokens": 35
}
}
Streaming — Set "stream": true to receive Server-Sent Events (SSE):
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "stream": true}'
Thinking/Reasoning Mode:
MindRouter supports multiple formats for controlling thinking/reasoning on models that support it (qwen3.5, qwen3, gpt-oss):
// gpt-oss: control reasoning depth
{
"model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "Solve this step by step"}],
"reasoning_effort": "high",
"max_completion_tokens": 16384
}
// Qwen-style: toggle thinking on/off
{
"model": "qwen/qwen3.5-400b",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"chat_template_kwargs": {"enable_thinking": true},
"max_completion_tokens": 16384
}
When thinking is enabled, the response includes reasoning_content alongside content.
max_completion_tokens (or max_tokens) to set an adequate budget — 16384 is recommended for qwen3.5-400b with thinking enabled.
Output Token Limits:
MindRouter accepts both max_completion_tokens (preferred, current OpenAI standard) and max_tokens (legacy). If both are provided, max_completion_tokens takes priority.
Structured Output (JSON Mode):
{
"model": "llama3.2",
"messages": [{"role": "user", "content": "List 3 colors as JSON"}],
"response_format": {"type": "json_object"}
}
Structured Output (JSON Schema):
{
"model": "llama3.2",
"messages": [{"role": "user", "content": "List 3 colors"}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "colors",
"schema": {
"type": "object",
"properties": {
"colors": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
Vision (Multimodal):
{
"model": "llava",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
]
}
Tool Calling (Function Calling):
{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What's the weather in Seattle?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto"
}
When the model decides to call a tool, the response includes tool_calls with finish_reason: "tool_calls". Submit the tool result back as a role: "tool" message with the matching tool_call_id.
Embeddings
curl -X POST http://localhost:8000/v1/embeddings \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "nomic-embed-text", "input": "Hello world"}'
Reranking
curl -X POST http://localhost:8000/v1/rerank \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Reranker-8B",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of AI.",
"The weather is sunny today.",
"Deep learning uses neural networks."
],
"top_n": 2
}'
Scoring
curl -X POST http://localhost:8000/v1/score \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Reranker-8B",
"text_1": "What is machine learning?",
"text_2": ["Machine learning is a subset of AI.", "The weather is sunny today."]
}'
List Models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer mr2_your-api-key"
Response:
{
"object": "list",
"data": [
{
"id": "llama3.3:70b",
"object": "model",
"created": 1700000000,
"owned_by": "mindrouter",
"capabilities": {
"multimodal": false,
"embeddings": false,
"structured_output": true,
"thinking": false
},
"backends": ["node1-gpu0", "node1-gpu2"],
"context_length": 32768,
"model_max_context": 131072,
"parameter_count": "70B",
"quantization": "Q4_K_M",
"family": "llama"
}
]
}
Fields include:
context_length— effective context window (num_ctx injected per request)model_max_context— architectural maximum context the model supportsparameter_count— model size (e.g. "7B", "70B")quantization— quantization level (e.g. "Q4_K_M", "FP16")family— model family (e.g. "llama", "qwen2")capabilities.thinking— whether the model supports thinking/reasoning mode
Ollama-Compatible Endpoints
These endpoints accept and return data in Ollama's native format. Ollama clients can be pointed at MindRouter as a drop-in replacement.
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /api/chat | API Key | Ollama chat (streaming by default) |
| POST | /api/generate | API Key | Ollama text generation |
| POST | /api/embeddings | API Key | Ollama embeddings |
| GET | /api/tags | API Key | List models (Ollama format) |
List Models (Ollama Format)
curl http://localhost:8000/api/tags \
-H "Authorization: Bearer mr2_your-api-key"
Response:
{
"models": [
{
"name": "llama3.3:70b",
"model": "llama3.3:70b",
"modified_at": "2026-02-28T12:00:00",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"parameter_size": "70B",
"quantization_level": "Q4_K_M"
},
"context_length": 32768,
"model_max_context": 131072
}
]
}
Ollama Chat
curl -X POST http://localhost:8000/api/chat \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
Note: Ollama defaults to stream: true. Set "stream": false explicitly for non-streaming responses.
Thinking/Reasoning Mode (Ollama):
Use the think field at the top level:
// Qwen-style: boolean toggle
{"model": "qwen3-32k:32b", "messages": [...], "think": true, "stream": false}
// gpt-oss: string effort level
{"model": "gpt-oss-32k:120b", "messages": [...], "think": "high", "stream": false}
The response includes a thinking field in the message. For /api/generate, thinking content appears as a top-level thinking field alongside response.
Ollama Generate
curl -X POST http://localhost:8000/api/generate \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'
Anthropic-Compatible Endpoint
This endpoint accepts and returns data in the Anthropic Messages API format. Anthropic SDK clients (Python, TypeScript) can be pointed at MindRouter by setting base_url.
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /anthropic/v1/messages | API Key | Anthropic Messages API (streaming and non-streaming) |
Messages
curl -X POST http://localhost:8000/anthropic/v1/messages \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"max_tokens": 500,
"messages": [{"role": "user", "content": "Hello!"}]
}'
Response:
{
"id": "msg_abc123...",
"type": "message",
"role": "assistant",
"model": "llama3.2",
"content": [
{"type": "text", "text": "Hello! How can I help you today?"}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 10,
"output_tokens": 12
}
}
Streaming — Set "stream": true to receive Anthropic SSE events (message_start, content_block_delta, message_stop, etc.):
curl -X POST http://localhost:8000/anthropic/v1/messages \
-H "Authorization: Bearer mr2_your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "max_tokens": 500, "messages": [{"role": "user", "content": "Hi"}], "stream": true}'
System Prompt:
{
"model": "llama3.2",
"max_tokens": 500,
"system": "You are a helpful assistant.",
"messages": [{"role": "user", "content": "Hello!"}]
}
SDK Usage (Python)
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8000/anthropic",
api_key="mr2_your-api-key",
)
message = client.messages.create(
model="llama3.2",
max_tokens=500,
messages=[{"role": "user", "content": "Hello!"}],
)
Supported features:
- System prompts (string or content block array)
- Multimodal inputs (base64 and URL images)
- Tool calling —
toolswithinput_schema,tool_choice(auto/any/tool),tool_use/tool_resultcontent blocks, streaming tool use withinput_json_delta - Thinking/reasoning mode (
thinking.type:enabled,adaptive,disabled) - Structured output via
output_config.formatwithtype: "json_schema" - Parameters:
max_tokens,temperature,top_p,top_k,stop_sequences,stream metadata.user_idmapping
Note: This is inbound-only — there are no Anthropic backends. Requests are translated to canonical format and routed to Ollama/vLLM backends like any other request.
Health & Metrics Endpoints
These endpoints are unauthenticated and intended for monitoring infrastructure.
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /healthz | None | Liveness probe (always 200 if app is running) |
| GET | /readyz | None | Readiness probe (checks DB + healthy backends) |
| GET | /metrics | None | Prometheus metrics (text format) |
| GET | /status | None | Cluster status summary (JSON) |
| GET | /api/cluster/total-tokens | None | Total tokens ever served (cached 10s) |
| GET | /api/cluster/trends | None | Token and active-user trends (?range=hour|day|week|month|year) |
| GET | /api/cluster/throughput | None | Token throughput, requests/min, active requests |
Prometheus Metrics
The /metrics endpoint exposes the following Prometheus metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
mindrouter_requests_total | Counter | endpoint, status | Total requests processed |
mindrouter_request_latency_seconds | Histogram | endpoint | Request latency |
mindrouter_queue_size | Gauge | — | Current scheduler queue depth |
mindrouter_active_backends | Gauge | — | Number of healthy backends |
mindrouter_tokens_total | Counter | type (prompt/completion) | Total tokens processed |
Admin API Endpoints
All admin endpoints require membership in a group with is_admin = true. They are mounted under /api/admin/.
Backend Management
| Method | Path | Description |
|---|---|---|
| POST | /api/admin/backends/register | Register a new inference backend |
| GET | /api/admin/backends | List all backends |
| PATCH | /api/admin/backends/{id} | Update backend properties |
| POST | /api/admin/backends/{id}/disable | Disable a backend |
| POST | /api/admin/backends/{id}/enable | Enable a disabled backend |
| POST | /admin/backends/{id}/drain | Initiate graceful drain (stops new requests, waits for in-flight to complete, then disables). Dashboard route only, not available via API. |
| POST | /api/admin/backends/{id}/refresh | Force-refresh capabilities and models |
| POST | /api/admin/backends/{id}/ollama/pull | Start pulling a model to an Ollama backend |
| GET | /api/admin/backends/{id}/ollama/pull/{job_id} | Poll progress of an Ollama model pull |
| POST | /api/admin/backends/{id}/ollama/delete | Delete a model from an Ollama backend |
Node Management
| Method | Path | Description |
|---|---|---|
| POST | /api/admin/nodes/register | Register a new GPU node |
| GET | /api/admin/nodes | List all nodes |
| PATCH | /api/admin/nodes/{id} | Update node properties |
| DELETE | /api/admin/nodes/{id} | Delete a node (fails if backends reference it) |
| POST | /api/admin/nodes/{id}/refresh | Force-refresh sidecar data |
User Management
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/users | List users (filterable by group, searchable by name/email) |
| POST | /api/admin/users | Create a new user with group-based quota defaults |
| GET | /api/admin/users/{id} | User detail with usage stats, API keys, monthly usage |
| PATCH | /api/admin/users/{id} | Update user profile, group, and quota overrides |
| DELETE | /api/admin/users/{id} | Hard-delete a user and all associated data |
| POST | /api/admin/users/{id}/api-keys | Create an API key for a user |
Group Management
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/groups | List all groups with user counts |
| POST | /api/admin/groups | Create a new group with quota defaults |
| PATCH | /api/admin/groups/{id} | Update group defaults (token budget, RPM, weight, etc.) |
| DELETE | /api/admin/groups/{id} | Delete a group (fails if users are assigned) |
API Key Management
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/api-keys | List all API keys across users (searchable, filterable by status) |
Quota Management
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/quota-requests | List pending quota increase requests |
| POST | /api/admin/quota-requests/{id}/review | Approve or deny a quota request |
Conversations
| Method | Path | Description |
|---|---|---|
| GET | /admin/conversations/export | Export conversations as CSV or JSON (form-based, filterable) |
| GET | /api/admin/conversations/export | Bulk API export (JSON with content, for programmatic access) |
Queue & Audit
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/queue | Scheduler queue statistics |
| GET | /api/admin/audit/search | Search audit logs (filter by user, model, status, date, text) |
| GET | /api/admin/audit/{id} | Full audit detail including prompt and response content |
Telemetry
| Method | Path | Description |
|---|---|---|
| GET | /api/admin/telemetry/overview | Cluster-wide telemetry (nodes, backends, GPUs) |
| GET | /api/admin/telemetry/latest | Lightweight polling endpoint for dashboard |
| GET | /api/admin/telemetry/backends/{id}/history | Time-series telemetry for a backend |
| GET | /api/admin/telemetry/gpus/{id}/history | Time-series telemetry for a GPU device |
| GET | /api/admin/telemetry/nodes/{id}/history | Aggregated time-series for a node (all GPUs) |
| GET | /api/admin/telemetry/export | Export raw telemetry as JSON or CSV |
Web Dashboard
MindRouter includes a full web dashboard built with Bootstrap 5. All pages extend a common base template with navigation and accessibility features (WCAG 2.1 Level AA).
Public Pages
| Page | URL | Description |
|---|---|---|
| Cluster Status | / | Shows healthy backend count, available models, queue size, and overall cluster status |
| Login | /login | Username/password authentication (Azure AD SSO when configured) |
| Blog | /blog | Public blog listing with published posts |
| Blog Post | /blog/{slug} | Individual blog post rendered from Markdown |
User Dashboard
| Page | URL | Description |
|---|---|---|
| Dashboard | /dashboard | Token usage progress bar, active API keys, quota usage history, change password |
| Change Password | POST /dashboard/change-password | Change password for local (non-SSO) accounts. Requires current password, new password (min 8 chars), and confirmation. Not available for Azure AD SSO users. |
| Request Quota | /dashboard/request-quota | Submit a quota increase request with justification |
| Key Created | (after creation) | Displays the full API key once (copy-to-clipboard) |
The user dashboard includes:
- Dark mode toggle — saved to browser localStorage, persists across sessions
- Live token usage — usage statistics poll every 1 second for real-time feedback without page refresh
- Lifetime vs rolling usage — displays both Lifetime Token Usage (all-time total, never resets) and Current Period Usage (resets when the budget period rolls over)
- Quota details — current RPM limit and max concurrent requests shown in the Quota Details card
- API key expiration warnings — keys within 7 days of expiration show a yellow countdown; expired keys display an “Expired” badge; “Last Used” column shows last authentication time
Admin Dashboard
The admin dashboard has a persistent sidebar with links to all admin pages. Access requires membership in a group with is_admin = true.
| Page | URL | Description |
|---|---|---|
| Overview | /admin | System metrics overview, health alert banner for unhealthy backends/offline nodes, pending request badges, system force offline/online toggle |
| Backends | /admin/backends | Backend health, models, enable/disable/drain controls |
| Nodes | /admin/nodes | GPU node management, sidecar status, hardware specs, take offline/bring online/force drain controls |
| Models | /admin/models | Model management: multimodal and thinking overrides, full metadata overrides (family, parameters, quantization, context length, etc.), Ollama pull/delete |
| GPU Metrics | /admin/metrics | Real-time GPU utilization, memory, temperature, power charts with time range controls |
| Users | /admin/users | User accounts with search, group filter, and pagination |
| User Detail | /admin/users/{id} | Individual user profile, usage stats, API keys, monthly usage chart, quota overrides, masquerade (view dashboard as user) |
| Groups | /admin/groups | Group management: create, edit, delete groups with quota defaults and scheduler weights |
| API Keys | /admin/api-keys | All API keys across users with search, status filter, and pagination |
| Requests | /admin/requests | Pending API key and quota increase requests, approve/deny |
| Audit Log | /admin/audit | Inference request history with filtering, search, and CSV/JSON export |
| Conversations | /admin/conversations | Browse and export all user conversations |
| Chat Configuration | /admin/chat-config | Configure chat UI defaults: core models, default model, system prompt, max tokens, temperature, thinking mode |
| Blog Management | /admin/blog | Blog CMS: create, edit, publish/unpublish, and delete blog posts |
| Site Settings | /admin/settings | Global settings: display timezone, Ollama context length enforcement |
System Force Offline/Online
Administrators can force the entire MindRouter system offline from the admin overview page (/admin). This is useful for planned maintenance windows or emergency situations.
- Force Offline (
POST /admin/system/toggle-online) — Stops the backend polling loop, marks all backends as unhealthy in the database, and sets an internal_force_offlineflag. While offline, no health checks or discovery runs, and no inference requests can be served. - Force Online — Clears the offline flag, closes and reloads all backend adapters and sidecar clients from the database, restarts the polling loop, and immediately runs a full poll cycle to restore backend health status.
The toggle is available as a button on the admin overview page. The system status is visible to all users on the public status page.
Node Lifecycle Management
Beyond basic node registration and editing, administrators can manage node operational state from the Nodes page (/admin/nodes):
| Action | Route | Description |
|---|---|---|
| Take Offline | POST /admin/nodes/{id}/take-offline | Disables all backends on the node and marks the node status as OFFLINE. New requests will not be routed to any backend on this node. |
| Bring Online | POST /admin/nodes/{id}/bring-online | Re-enables all backends on the node and marks the node status as ONLINE. Backends will begin receiving requests again after the next health poll. |
| View Active Requests | GET /admin/nodes/{id}/active-requests | Returns a JSON count of in-flight requests across all backends on the node. Useful for monitoring drain progress before taking a node offline. |
| Force Drain | POST /admin/nodes/{id}/force-drain | Force-cancels all active (in-flight) requests on the node's backends. Returns the number of cancelled requests. Use this when you need to take a node offline immediately without waiting for requests to complete naturally. |
Recommended workflow for node maintenance:
- Take the node offline to stop new requests from being routed there.
- Monitor active requests using the active requests endpoint until the count reaches zero.
- If requests are stuck, use force drain to cancel them.
- Perform maintenance (upgrade vLLM, restart Ollama, etc.).
- Bring the node back online.
Admin Models Page
The Models page at /admin/models provides comprehensive model management. Models are grouped by name (since the same model may appear on multiple backends) and presented with their current metadata and override status.
Capability Overrides
MindRouter auto-detects model capabilities during discovery, but admins can override the detected values:
- Multimodal override — Toggle or reset the
supports_multimodalflag for all instances of a model. Use this when auto-detection incorrectly classifies a model (e.g., a vision model whose name does not match the usual patterns). - Thinking override — Toggle or reset the
supports_thinkingflag for all instances. Use this for models that support thinking/reasoning mode but are not detected automatically. - Resetting an override returns the field to auto-detected values from the next discovery cycle.
Metadata Overrides
Admins can override any model metadata field across all instances of a model via the metadata edit form. Overridable fields:
| Field | Type | Description |
|---|---|---|
family | string | Model family (e.g., llama, qwen2) |
parameter_count | string | Model size (e.g., 7B, 70B) |
quantization | string | Quantization level (e.g., Q4_K_M, FP16) |
context_length | int | Effective context window (injected as num_ctx). Overrides the auto-discovered value. |
embedding_length | int | Embedding dimension length |
head_count | int | Number of attention heads |
layer_count | int | Number of transformer layers |
feed_forward_length | int | Feed-forward network dimension |
model_format | string | Model format (e.g., gguf, safetensors) |
parent_model | string | Parent model identifier |
capabilities | string | Comma-separated list of capabilities (stored as JSON array) |
description | string | Human-readable model description |
model_url | string | URL to model documentation or repository |
Clearing a field (setting it to blank) removes the override and reverts to the auto-discovered value. Admins can also reset all overrides at once via the "Reset All Overrides" action.
Ollama Model Pull and Delete
The Models page lists all Ollama backends and provides UI controls to:
- Pull a model — Trigger a model download on a specific Ollama backend via the admin API (
POST /api/admin/backends/{id}/ollama/pull). Progress can be polled until completion. - Delete a model — Remove a model from a specific Ollama backend via the admin API (
POST /api/admin/backends/{id}/ollama/delete).
Admin Masquerade
Administrators can "masquerade" as another user to view the user dashboard from that user's perspective. This is useful for troubleshooting user-reported issues or verifying quota and usage display.
- Start masquerade (
POST /admin/masquerade/{target_user_id}) — Sets a signed cookie (mindrouter_masquerade) containing the target user ID. The cookie is signed usingitsdangerous.URLSafeTimedSerializerwith the application'sSECRET_KEYand a"masquerade"salt. The cookie expires after 24 hours. - During masquerade — The user dashboard (
/dashboard) shows the target user's data (usage stats, API keys, quota, conversations) instead of the admin's own data. The masquerade applies only to read-only dashboard views — admin routes and actions are never affected. - Stop masquerade (
POST /admin/masquerade/stop) — Deletes the masquerade cookie and redirects back to the admin users page. - Security — Only users in an admin group can start a masquerade. The target user ID is verified to exist before the cookie is set. The signed cookie prevents tampering.
Masquerade is initiated from the user detail page (/admin/users/{id}) via a "View as User" button.
Audit Log Export
Administrators can export audit log data as CSV or JSON via GET /admin/audit/export. The export supports the same filters as the audit log UI:
| Parameter | Type | Description |
|---|---|---|
format | string | Output format: csv (default) or json |
search | string | Free-text search across request fields |
user_id_filter | int | Filter by user ID |
model_filter | string | Filter by model name |
status_filter | string | Filter by request status |
start_date | string | Start date (ISO 8601 format) |
end_date | string | End date (ISO 8601 format) |
include_content | bool | Include prompt/response content in export (default: false) |
Export fields: request_uuid, created_at, user_id, model, endpoint, status, prompt_tokens, completion_tokens, total_tokens, total_time_ms, error_message. When include_content=true, additional fields are included: messages, prompt, parameters, response_content, finish_reason.
Exports are limited to 10,000 records. The file is downloaded as an attachment (audit_log.csv or audit_log.json).
Admin Chat Configuration
The Chat Configuration page at /admin/chat-config allows administrators to control default settings for the built-in chat interface. All settings are stored in the AppConfig database and take effect immediately.
| Setting | Description | Details |
|---|---|---|
| Core Models | Subset of models shown in the chat model selector | Select from all available (non-embedding) models on healthy backends. When set, only these models appear in the chat UI dropdown. Empty means show all models. |
| Default Model | Pre-selected model in the chat UI | The model automatically selected when a user starts a new conversation. Blank means no default. |
| System Prompt | Global system prompt for all chat conversations | Prepended to every chat conversation. Blank removes the override and reverts to the built-in default. Can also be explicitly reset via a "Reset" button. |
| Max Tokens | Default max tokens for chat requests | Range: 256–131072. Default: 16384. Clamped to the allowed range on save. |
| Temperature | Default temperature for chat responses | Range: 0.0–2.0. Blank means use the model's default temperature. |
| Thinking Mode | Default thinking/reasoning mode | Options: true, false, low, medium, high, or blank (no default). Controls whether thinking is enabled by default for models that support it. |
Chat Interface
| Page | URL | Description |
|---|---|---|
| Chat | /chat | Full-featured chat UI with model selection, streaming, file upload, multimodal support |
The chat interface supports:
- Collapsible conversation sidebar
- Model and backend selection
- Real-time streaming responses
- File upload via button or drag-and-drop anywhere in the chat window (images, PDFs, DOCX, XLSX, CSV, JSON, Markdown, etc.)
- Vision model support with automatic image handling
- Web search toggle — when enabled, queries the Brave Search API and injects results as context into the system prompt (requires
BRAVE_SEARCH_API_KEY) - Code syntax highlighting
- LaTeX rendering
- Message editing and deletion
- Dark mode toggle (stored in browser localStorage)
- Advanced models toggle — show or hide non-core models in the model selector
- Per-request thinking controls — enable/disable thinking mode and set reasoning effort for supported models
- Collapsible thinking blocks in assistant responses
- Keyboard shortcuts — Shift+Enter for newline, Enter to send
- Sidebar collapse for a wider chat area
- Copy buttons on code blocks and assistant messages
- Image lightbox for viewing uploaded images full-size
- Auto-titling of conversations based on the first message
Users, Groups & Quotas
Group System
MindRouter uses a database-driven group system for authorization, quota defaults, and scheduler weights. Each user belongs to exactly one group. Groups are fully manageable via the admin UI at /admin/groups.
Admin privileges are determined by the group's is_admin flag. Users in an admin group can access the admin dashboard and admin API endpoints.
Default Groups
| Group | Token Budget | RPM | Max Concurrent | Sched. Weight | Admin |
|---|---|---|---|---|---|
students | 100,000 | 30 | 2 | 1 | No |
staff | 500,000 | 60 | 4 | 2 | No |
faculty | 1,000,000 | 120 | 8 | 3 | No |
researchers | 1,000,000 | 120 | 8 | 3 | No |
admin | 10,000,000 | 1,000 | 50 | 10 | Yes |
nerds | 500,000 | 60 | 4 | 2 | No |
other | 100,000 | 30 | 2 | 1 | No |
These 7 groups are created automatically by the database migration. Admins can create additional groups, edit defaults, or delete empty groups via the admin UI or API.
User Profiles
Each user has the following profile fields (editable by admins at /admin/users/{id}):
- Username and Email — unique identifiers
- Full Name — display name
- Group — determines default quotas, scheduler weight, and admin access
- College and Department — organizational affiliation
- Intended Use — free-text description of how the user plans to use the service
Quota Inheritance & Overrides
When a user is created, their quota defaults are inherited from their group. Admins can override any quota value per-user:
- Token budget — inherited from group, overridable per user
- RPM limit — inherited from group, overridable per user
- Max concurrent — inherited from group, overridable per user
- Weight override — if set, overrides the group's scheduler weight for this user
API Key Lifecycle
- Generation — Keys use the format
mr2_<random_urlsafe_base64>(48+ characters total). - Storage — The raw key is shown once at creation. Only the Argon2 hash and a prefix (
mr2_<first 8 chars>) are stored in the database. - Verification — Lookup by prefix (fast), then full Argon2 hash verification.
- Expiration — Optional
expires_attimestamp. - Revocation — Keys can be revoked (soft-delete) without deleting the audit trail.
- Usage tracking —
last_used_atandusage_countupdated atomically on each request.
Quota System
Each user has a quota record with:
- Token budget — Total tokens allowed per period. Deducted on each completed request (prompt + completion tokens).
- RPM limit — Maximum requests per minute.
- Max concurrent — Maximum simultaneous in-flight requests.
When a quota is exceeded, the request is rejected with HTTP 429.
Rolling Budget Period
Token budgets use rolling 30-day windows per user, not calendar months. Usage is tracked continuously and the oldest usage falls off as the window advances.
Quota Increase Requests
Users can submit quota increase requests via the dashboard (/dashboard/request-quota). The request includes:
- Desired token budget
- Written justification
Admins review requests at /admin/requests or via POST /api/admin/quota-requests/{id}/review, which can approve (with a custom granted token amount) or deny the request.
Backend Management
Node/Backend Model
MindRouter separates the concept of physical GPU servers (Nodes) from inference endpoints (Backends):
- A Node represents a physical server with GPUs and a sidecar agent.
- A Backend is an Ollama or vLLM instance running on a node.
- One node can host multiple backends, each assigned specific GPUs via
gpu_indices. - Backends without a
node_idwork as standalone endpoints (no GPU telemetry).
Node: gpu-server-1 (4x A100-80GB, sidecar at :8007)
+-- Backend: vllm-large (gpu_indices: [0, 1]) ← uses GPUs 0-1
+-- Backend: vllm-small (gpu_indices: [2]) ← uses GPU 2
+-- Backend: ollama-misc (gpu_indices: [3]) ← uses GPU 3
Supported Engines
| Engine | Health Check | Model Discovery | Telemetry Source |
|---|---|---|---|
| Ollama | GET /api/tags | GET /api/tags + POST /api/ps (loaded models) | Sidecar agent |
| vLLM | GET /health (fallback: GET /v1/models) | GET /v1/models | GET /metrics (Prometheus format) |
Registration
Register a node:
curl -X POST http://localhost:8000/api/admin/nodes/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "gpu-server-1",
"hostname": "gpu1.example.com",
"sidecar_url": "http://gpu1.example.com:8007",
"sidecar_key": "your-sidecar-secret-key"
}'
Register a backend on that node:
curl -X POST http://localhost:8000/api/admin/backends/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "ollama-gpu1",
"url": "http://gpu1.example.com:11434",
"engine": "ollama",
"max_concurrent": 4,
"node_id": 1,
"gpu_indices": [0, 1]
}'
Enable/Disable/Drain/Refresh
- Disable a backend to take it out of rotation without deleting it:
POST /api/admin/backends/{id}/disable - Enable to bring it back:
POST /api/admin/backends/{id}/enable - Drain for graceful offline maintenance:
POST /admin/backends/{id}/drain(dashboard-only route) - Refresh to force re-discovery of models and capabilities:
POST /api/admin/backends/{id}/refresh
Drain Mode
Drain mode provides graceful backend shutdown for maintenance. When a backend is set to draining:
- The scheduler stops routing new requests to the backend.
- All in-flight requests are allowed to complete normally.
- When the backend's queue depth reaches 0, it automatically transitions to disabled.
This avoids abruptly killing active requests when you need to restart an inference engine, upgrade models, or perform node maintenance. Use it before upgrading vLLM, restarting Ollama, or taking a GPU node offline.
Concurrency Alignment
The max_concurrent value registered in MindRouter must match the concurrency limit on the inference engine:
| Engine | Engine Setting | MindRouter Setting |
|---|---|---|
| vLLM | --max-num-seqs N | "max_concurrent": N |
| Ollama | OLLAMA_NUM_PARALLEL=N | "max_concurrent": N |
Why this matters: MindRouter uses max_concurrent to decide how many requests to route to a backend. If MindRouter thinks a backend can handle 8 concurrent requests but vLLM is configured with --max-num-seqs 4, the extra requests queue silently inside vLLM. MindRouter cannot see this hidden queue, so it continues routing requests there instead of spreading load to other backends. The result is uneven load distribution and unpredictable latency — the fair-share scheduler is effectively bypassed for those excess requests.
Context Length (num_ctx)
MindRouter automatically manages context length for Ollama backends:
- Auto-discovery — During model discovery,
context_lengthis set tomin(model_max_context, 32768)to prevent small models from consuming excessive VRAM. The architectural maximum is stored separately asmodel_max_context. - num_ctx injection — For every Ollama request, MindRouter injects
num_ctxmatching the model's configuredcontext_length. - Enforcement toggle — By default, MindRouter overrides any user-supplied
num_ctxto prevent GPU memory oversubscription. Admins can toggle this in Site Settings (/admin/settings) to allow users to set their ownnum_ctx. - Manual override — Admins can set
context_length_overrideper model via the admin UI to use a custom value instead of the auto-discovered one. - Ollama 0.17+ — Ollama automatically adjusts
num_ctxdownward if the requested value doesn't fit in GPU memory.
vLLM backends handle context length natively via --max-model-len and do not need num_ctx injection.
Retry & Failover
MindRouter automatically retries failed inference requests to improve reliability across the backend cluster.
- Automatic retries — Up to 3 attempts on 5xx errors, timeouts, and connection failures. Configurable via the
BACKEND_RETRY_MAX_ATTEMPTSenvironment variable. - Fast fail on 4xx — Client errors (400, 401, 404, etc.) are never retried and return immediately.
- Streaming constraints — Retries can only occur before the first chunk is sent to the client. Once streaming has begun, a backend failure is terminal and the stream ends with an error.
- Backend rotation — Each retry attempt selects a different backend when multiple backends are available for the requested model, maximizing the chance of success.
Mid-Stream Error Behavior
Once streaming has begun and the first chunk has been sent to the client, a backend failure terminates the SSE stream immediately with no error event. The client receives an abrupt end-of-stream. Retries are not possible after the first chunk because the response format has already been committed.
Circuit Breaker
MindRouter uses a per-backend circuit breaker to avoid routing requests to backends that are experiencing persistent failures.
- Trip threshold — After 3 consecutive failures (configurable), the backend is marked as "open" and excluded from routing for 30 seconds (configurable).
- Integration with retry — The circuit breaker works alongside retry. Backends with an open circuit are skipped during failover selection, so retries are directed to healthier backends.
- Automatic recovery — After the exclusion window expires, the backend is eligible for routing again. A successful request resets the failure counter.
Rate Limiting
Note: RPM and concurrent request rate limiting are defined in the codebase but not currently enforced. The rate limiter middleware is not registered in the application. Only token quota (monthly budget) enforcement is active, returning HTTP 429 when the token budget is exceeded.
- Requests per minute (RPM) — Configurable per group but not yet enforced at runtime.
- Concurrent request cap — Configurable per group but not yet enforced at runtime.
- Token quota (active) — Monthly token budget is enforced. Returns HTTP 429 when exceeded. Clients should implement exponential backoff.
Tool Calling
MindRouter supports tool calling (function calling) across all three API formats, with transparent translation between them.
- Tool definitions — Pass an OpenAI-style
toolsarray describing available functions, with optionaltool_choice("auto","none", or a specific function name). - Tool results — Submit results back via
role: "tool"messages with a matchingtool_call_id. - Cross-format support — Tool calls work whether the request arrives in OpenAI, Ollama, or Anthropic format. MindRouter translates tool definitions, tool calls, and tool results between all formats automatically.
- Backend requirement — The selected backend must support tool calling. For vLLM, this requires
--enable-auto-tool-choiceand the appropriate--tool-call-parserflag. For Ollama, tool calling is supported natively on compatible models.
Request ID
Every inference response includes a unique request ID for tracing and debugging.
- Auto-generated IDs — MindRouter generates request IDs with format-specific prefixes:
chatcmpl-*for chat completions,cmpl-*for text completions,embd-*for embeddings, etc. - Custom IDs — You can provide your own request ID by setting the
X-Request-IDheader. When provided, MindRouter uses your ID instead of generating one, making it easier to correlate requests across your systems.
Translation Layer
MindRouter's translation layer enables cross-engine routing: a request arriving in OpenAI, Ollama, or Anthropic format can be served by any Ollama or vLLM backend. All translation passes through a canonical internal schema.
Request Flow
OpenAI Request --> OpenAIInTranslator --> CanonicalChatRequest
|
Ollama Request --> OllamaInTranslator --> CanonicalChatRequest
|
Anthropic Request --> AnthropicInTranslator --> CanonicalChatRequest
|
v
[Scheduler selects backend]
|
+---------------------------+-------------------+
v v
OllamaOutTranslator VLLMOutTranslator
(Ollama backend) (vLLM backend, OpenAI format)
Canonical Schemas
The canonical internal representation (backend/app/core/canonical_schemas.py) includes:
- CanonicalChatRequest — model, messages, temperature, top_p, max_tokens, stream, tools, tool_choice, response_format, think (bool or string), reasoning_effort, etc.
- CanonicalMessage — role (system/user/assistant/tool), content (text or multimodal content blocks, nullable), tool_calls, tool_call_id
- ContentBlock — TextContent, ImageUrlContent, or ImageBase64Content
- CanonicalToolCall / CanonicalFunctionCall — tool call with id, function name, and arguments (JSON string)
- CanonicalToolDefinition — tool definition with function name, description, and parameters schema
- CanonicalEmbeddingRequest — model, input, encoding_format, dimensions
- CanonicalChatResponse / CanonicalStreamChunk — response and streaming types (including tool call deltas)
Key Translation Mappings
| Concept | OpenAI | Ollama | Anthropic | Canonical |
|---|---|---|---|---|
| Max tokens | max_completion_tokens or max_tokens | options.num_predict | max_tokens (required) | max_tokens |
| Stream default | false | true | false | — |
| System prompt | messages with role: system | messages with role: system | Top-level system field | CanonicalMessage(role=SYSTEM) |
| Stop sequences | stop | options.stop | stop_sequences | stop |
| JSON schema | response_format | format: {schema} | output_config.format | response_format |
| Parameters | Top-level fields | options dict | Top-level fields | Top-level fields |
| Images | image_url block | images array (base64) | image block with source | ImageBase64Content / ImageUrlContent |
| Tool definitions | tools | tools | tools (with input_schema) | tools (CanonicalToolDefinition) |
| Tool choice | tool_choice | — | tool_choice (auto/any/tool) | tool_choice |
| Tool calls | tool_calls (JSON string args) | tool_calls (dict args) | tool_use content blocks | CanonicalToolCall (JSON string args) |
| Tool results | role: "tool" + tool_call_id | — | tool_result content blocks | CanonicalMessage(role=TOOL, tool_call_id) |
| Thinking mode | think, thinking.type, chat_template_kwargs, reasoning_effort | think (bool or "low"/"medium"/"high") | thinking.type (enabled/disabled) | think (bool or string) |
| User ID | user | — | metadata.user_id | user |
| Stream format | SSE (data: {...}) | NDJSON | SSE (Anthropic events) | CanonicalStreamChunk |
Translators
| Translator | Direction | Purpose |
|---|---|---|
OpenAIInTranslator | API → Canonical | Translates incoming OpenAI-format requests |
OllamaInTranslator | API → Canonical | Translates incoming Ollama-format requests |
AnthropicInTranslator | API → Canonical | Translates incoming Anthropic Messages API requests; also formats responses and SSE stream events back to Anthropic format |
OllamaOutTranslator | Canonical → Backend | Translates outgoing requests to Ollama backends |
VLLMOutTranslator | Canonical → Backend | Translates outgoing requests to vLLM backends |
All translators use static methods — no instantiation needed.
Model-Specific Behaviors
- Qwen3-32B on vLLM — This model embeds
<think>...</think>tags directly in the content field instead of using thereasoning_contentfield. MindRouter automatically extracts these tags and moves the reasoning text to the canonicalreasoningfield for both streaming and non-streaming responses. - Qwen3.5 on vLLM with thinking disabled — When thinking is disabled but the model still returns reasoning content with an empty content field, MindRouter promotes the reasoning content to the content field so the response is not blank.
Telemetry & Monitoring
GPU Sidecar Agent
Each GPU node runs a lightweight FastAPI sidecar agent (sidecar/gpu_agent.py) that exposes per-GPU hardware metrics:
Collected metrics per GPU:
- Utilization (GPU % and memory %)
- Memory (used/free/total GB)
- Temperature (GPU and memory)
- Power draw and limit (watts)
- Fan speed, SM/memory clocks
- Running processes (PID + memory)
- Device identity (name, UUID, compute capability)
- Driver and CUDA versions
Authentication: Requires SIDECAR_SECRET_KEY env var. All requests must include X-Sidecar-Key header (constant-time comparison).
Prerequisites: Each GPU node must have NVIDIA drivers and the
NVIDIA Container Toolkit
installed so Docker can access GPUs via --gpus all:
# RHEL/Rocky Linux
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
| sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
# Debian/Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Configure and restart Docker
sudo nvidia-ctk runtime configure --driver=docker
sudo systemctl restart docker
# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Deployment options:
- Docker Compose —
docker compose --profile gpu up gpu-sidecar - Standalone Docker — Build from
sidecar/Dockerfile.sidecar, run with--gpus all - Direct Python —
pip install fastapi uvicorn nvidia-ml-py && python sidecar/gpu_agent.py
Health Polling
The Backend Registry runs an adaptive polling loop:
- Startup fast polls: On container start, two immediate full poll cycles run (with a 5-second gap) so backends and nodes are marked healthy within seconds of a restart instead of waiting for the normal 30-second cycle.
- Normal interval: 30 seconds (configurable via
BACKEND_POLL_INTERVAL) - Fast interval: 10 seconds after a backend becomes unhealthy (configurable)
- Fast duration: 120 seconds before returning to normal polling
Each poll cycle has two phases:
- Poll sidecar agents (one per physical node) for GPU snapshots
- Poll each backend adapter for health, models, and engine-specific telemetry
Health Alerts
The admin dashboard (/admin) displays a prominent warning banner when any backend is unhealthy/unknown or any node is offline/unknown. The alert includes counts and names of affected items with direct links to the backends or nodes management pages. Intentionally disabled backends are excluded from the alert — only unexpected health issues are flagged.
Circuit Breaker
Per-backend circuit breaker protects against cascading failures:
- Threshold: 3 consecutive failures before opening (configurable via
BACKEND_CIRCUIT_BREAKER_THRESHOLD) - Recovery: 30 seconds before allowing a probe request (
BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDS) - States: Closed (healthy) → Open (failing) → Half-Open (probe) → Closed (recovered)
Latency Tracking
Exponential Moving Average (EMA) tracks per-backend latency:
- Alpha: 0.3 (30% current observation, 70% history)
- Metrics: Total latency EMA and TTFT (time-to-first-token) EMA
- Throughput score:
1.0 / (1.0 + latency_ms / 5000.0)— used in backend scoring - Persistence: EMAs are periodically saved to the database for recovery after restart
Prometheus Metrics
Scrape /metrics for Prometheus-compatible metrics. See the Health & Metrics Endpoints section for the full list.
Telemetry API
Admin users can access detailed telemetry via the API:
- Cluster overview — All nodes, backends, GPUs with current metrics
- Historical data — Time-series with configurable resolution (1m, 5m, 15m, 1h, 6h, 1d)
- Per-GPU history — Individual GPU device telemetry over time
- Export — Download telemetry data as JSON or CSV
Redis (Optional)
Redis is an optional dependency configured via REDIS_URL. When available, it serves three roles:
- Inflight streaming token counting — Atomically tracks tokens currently being streamed across all workers, enabling accurate real-time throughput metrics.
- Per-user quota token caching — Caches per-user token counters (
quota:tokens:{user_id}) for fast atomic increments without hitting the database on every request. - Graceful degradation — All features that depend on Redis continue to work without it. Quota enforcement falls back to database queries, and inflight token counts default to zero. No functionality is lost; only caching benefits are reduced.
Chat System
MindRouter includes a built-in chat interface at /chat with full conversation management.
Conversations
- Each user has their own conversation history
- Conversations store: title, selected model, creation/update timestamps
- Users can rename, switch models, or delete conversations
- Up to 50 conversations shown in the sidebar (most recent first)
Messages
- Messages include role (user/assistant/system) and content
- Assistant messages are streamed in real-time
- Messages can be edited or deleted after creation
- Attachments are linked to individual messages
File Upload
Supported file types and processing:
| Category | Extensions | Processing |
|---|---|---|
| Images | .jpg, .jpeg, .png, .gif, .webp | Resized to max 1536px, compressed JPEG q85, thumbnail generated |
| Documents | .pdf | Text extracted from all pages, first-page thumbnail generated |
| Documents | .docx | Text extracted from all paragraphs |
| Spreadsheets | .xlsx | All sheets read, formatted as tab-separated text |
| Text files | .txt, .md, .csv, .json, .html, .htm, .log | Read as-is |
Limits:
- Max upload size: 10 MB (configurable via
CHAT_UPLOAD_MAX_SIZE_MB) - Artifact storage path:
/data/artifacts(configurable viaARTIFACT_STORAGE_PATH) - Artifact max size: 50 MB (configurable via
ARTIFACT_MAX_SIZE_MB) - Artifact retention: 365 days
Storage layout:
/artifacts/YYYY/MM/DD/<sha256_prefix>/<full_sha256>_<uuid>.<ext>
Multimodal Model Support
- Models with multimodal capability are automatically detected by name patterns (e.g.,
llava,-vl-,vision) - When images are sent to a multimodal model, they are included as base64-encoded content blocks
- When images are sent to a non-multimodal model, they are replaced with a placeholder:
[Image omitted -- model does not support multimodal input: filename] - A warning modal is shown in the chat UI when uploading images to a non-multimodal model
- Admins can override multimodal detection per model via the Models admin page
Streaming
Chat responses are streamed in real-time:
- Backend streaming uses NDJSON (Ollama) or SSE (vLLM/OpenAI)
- The chat UI renders tokens as they arrive
- TTFT (time-to-first-token) is tracked for latency monitoring
- If the client disconnects, the backend request is not cancelled (to prevent DB corruption)
Conversation Retention
Conversations older than CONVERSATION_RETENTION_DAYS (default 730 days / 2 years) are automatically purged by a background task that runs every CONVERSATION_CLEANUP_INTERVAL seconds (default 86400 / once per day). Associated messages and attachments are deleted along with the conversation.
Blog / CMS
MindRouter includes a built-in blog system with public viewing and admin content management. Blog posts are written in Markdown and rendered to HTML with syntax highlighting, tables, fenced code blocks, and table of contents support.
Public Blog
- Blog listing at
/blog— displays all published posts, most recent first - Individual posts at
/blog/{slug}— renders the post's Markdown content as styled HTML - Posts are accessible without authentication
Admin Blog Management
Admin users can manage blog posts at /admin/blog. The admin interface provides a full CMS workflow:
| Action | Route | Description |
|---|---|---|
| List all posts | GET /admin/blog | View all posts (published, draft, and soft-deleted) with status indicators |
| New post | GET /admin/blog/new | Form to create a new blog post |
| Create post | POST /admin/blog/new | Submit new post with title, slug, Markdown content, excerpt, and publish toggle |
| Edit post | GET /admin/blog/{id}/edit | Edit form for an existing post |
| Update post | POST /admin/blog/{id}/edit | Save changes to a post |
| Toggle publish | POST /admin/blog/{id}/publish | Publish a draft or unpublish a live post |
| Delete post | POST /admin/blog/{id}/delete | Soft-delete a post (not permanently removed from the database) |
Post Fields
- Title — Display title of the post
- Slug — URL-safe identifier (auto-generated from title or manually set). Used in the public URL as
/blog/{slug} - Content — Full post body written in Markdown. Supports fenced code blocks with syntax highlighting, tables, and table of contents
- Excerpt — Optional short summary shown in the blog listing
- Published — Toggle for publish state. The
published_attimestamp is set automatically when a post is first published - Author — Automatically set to the admin user who creates the post
Soft Delete
Deleting a post is a soft delete — the post is marked as deleted but remains in the database. Soft-deleted posts are hidden from the public blog listing but still visible in the admin list.
AppConfig System
MindRouter stores runtime configuration in a key-value AppConfig database table. Values are JSON-encoded and can be read or written at any time without restarting the application. This powers several features that need runtime-adjustable settings.
API
Configuration is accessed via two async CRUD functions in backend/app/db/crud.py:
get_config_json(db, key, default)— Read a config value, JSON-decoded. Returnsdefaultif the key does not exist.set_config(db, key, value, description=None)— Upsert a config value (JSON-encoded). Creates the key if it does not exist, updates it otherwise.
Known Configuration Keys
| Key | Type | Default | Description | Managed Via |
|---|---|---|---|---|
app.timezone | string | None | Display timezone for the dashboard (e.g. "America/Los_Angeles") | Site Settings |
ollama.enforce_num_ctx | bool | true | Whether MindRouter overrides user-supplied num_ctx values | Site Settings |
chat.core_models | list | [] | Subset of models shown in the chat interface model selector. Empty means show all. | Chat Configuration |
chat.default_model | string | null | Pre-selected model in the chat interface | Chat Configuration |
chat.system_prompt | string | null | Global system prompt prepended to all chat conversations. Blank removes the override. | Chat Configuration |
chat.max_tokens | int | 16384 | Default max tokens for chat requests (range: 256–131072) | Chat Configuration |
chat.temperature | float | null | Default temperature for chat requests (range: 0.0–2.0). Null means use model default. | Chat Configuration |
chat.think | bool/string | null | Default thinking mode for chat: true, false, "low", "medium", "high", or null (no default) | Chat Configuration |
All AppConfig values take effect immediately upon save — no application restart is required. These settings are managed via the Admin Dashboard under Site Settings and Chat Configuration.
Configuration Reference
All settings are loaded from environment variables or .env / .env.prod files. Variable names are case-insensitive.
Application
| Variable | Type | Default | Description |
|---|---|---|---|
APP_NAME | str | MindRouter | Application name |
APP_VERSION | str | (from pyproject.toml) | Application version (read dynamically at startup) |
DEBUG | bool | false | Enable debug mode |
RELOAD | bool | false | Auto-reload on code changes (development) |
Database
| Variable | Type | Default | Description |
|---|---|---|---|
DATABASE_URL | str | mysql+pymysql://... | MariaDB/MySQL connection string |
DATABASE_POOL_SIZE | int | 30 | Connection pool size |
DATABASE_MAX_OVERFLOW | int | 20 | Max overflow connections beyond pool |
DATABASE_ECHO | bool | false | Log SQL queries |
Cache
| Variable | Type | Default | Description |
|---|---|---|---|
REDIS_URL | str | None | Redis connection string (optional) |
Security
| Variable | Type | Default | Description |
|---|---|---|---|
SECRET_KEY | str | dev-secret-key-... | JWT/session signing key (change in production) |
JWT_ALGORITHM | str | HS256 | JWT signing algorithm |
JWT_EXPIRATION_HOURS | int | 24 | JWT token lifetime |
SESSION_COOKIE_NAME | str | mindrouter_session | Session cookie name |
SESSION_COOKIE_SECURE | bool | false | HTTPS-only cookies |
SESSION_COOKIE_HTTPONLY | bool | true | JavaScript-inaccessible cookies |
SESSION_COOKIE_SAMESITE | str | lax | SameSite cookie policy |
API_KEY_HASH_ALGORITHM | str | argon2 | API key hashing algorithm |
Azure AD SSO (Optional)
| Variable | Type | Default | Description |
|---|---|---|---|
AZURE_AD_CLIENT_ID | str | None | Azure AD application (client) ID |
AZURE_AD_CLIENT_SECRET | str | None | Azure AD client secret |
AZURE_AD_TENANT_ID | str | None | Azure AD tenant ID |
AZURE_AD_REDIRECT_URI | str | https://<host>/login/azure/authorized | OAuth2 redirect URI |
AZURE_AD_DEFAULT_GROUP | str | other | Default group for new Azure AD users |
When AZURE_AD_CLIENT_ID and AZURE_AD_TENANT_ID are set, a "Sign in with Microsoft" button appears on the login page. Users are JIT-provisioned on first login — their jobTitle from Microsoft Graph determines group assignment (student/faculty/staff/other). Pre-existing accounts are linked by email.
Azure AD Group Mapping
Group assignment uses substring matching on the user's jobTitle field from Microsoft Graph:
jobTitlecontains "student" →studentsgroupjobTitlecontains "faculty" or "professor" →facultygroupjobTitlecontains "staff" →staffgroup- No match → falls back to
AZURE_AD_DEFAULT_GROUP(default:other)
Web Search (Optional)
| Variable | Type | Default | Description |
|---|---|---|---|
BRAVE_SEARCH_API_KEY | str | None | Brave Search API key (enables web search in chat) |
BRAVE_SEARCH_MAX_RESULTS | int | 5 | Maximum search results to inject as context |
When configured, a search toggle appears in the chat input area. Enabling it queries the Brave Search API with the user's message and injects results into the system prompt as additional context.
Conversation Retention
| Variable | Type | Default | Description |
|---|---|---|---|
CONVERSATION_RETENTION_DAYS | int | 730 | Conversation retention period (2 years) |
CONVERSATION_CLEANUP_INTERVAL | int | 86400 | Cleanup interval in seconds (24 hours) |
Artifact Storage
| Variable | Type | Default | Description |
|---|---|---|---|
ARTIFACT_STORAGE_PATH | str | /data/artifacts | File storage directory |
ARTIFACT_MAX_SIZE_MB | int | 50 | Max artifact file size |
ARTIFACT_RETENTION_DAYS | int | 365 | Artifact retention period |
Quotas
Quota defaults are now managed per-group in the database via /admin/groups. The environment variables below are deprecated (used only for initial migration seeding) and will be removed in a future release.
| Variable | Type | Default | Description |
|---|---|---|---|
DEFAULT_TOKEN_BUDGET_STUDENT | int | 100000 | Deprecated — use group defaults |
DEFAULT_TOKEN_BUDGET_STAFF | int | 500000 | Deprecated — use group defaults |
DEFAULT_TOKEN_BUDGET_FACULTY | int | 1000000 | Deprecated — use group defaults |
DEFAULT_TOKEN_BUDGET_ADMIN | int | 10000000 | Deprecated — use group defaults |
DEFAULT_RPM_STUDENT | int | 30 | Deprecated — use group defaults |
DEFAULT_RPM_STAFF | int | 60 | Deprecated — use group defaults |
DEFAULT_RPM_FACULTY | int | 120 | Deprecated — use group defaults |
DEFAULT_RPM_ADMIN | int | 1000 | Deprecated — use group defaults |
DEFAULT_MAX_CONCURRENT_STUDENT | int | 2 | Deprecated — use group defaults |
DEFAULT_MAX_CONCURRENT_STAFF | int | 4 | Deprecated — use group defaults |
DEFAULT_MAX_CONCURRENT_FACULTY | int | 8 | Deprecated — use group defaults |
DEFAULT_MAX_CONCURRENT_ADMIN | int | 50 | Deprecated — use group defaults |
Scheduler
Scheduler weights are now managed per-group in the database. The per-role weight variables below are deprecated and will be removed in a future release.
| Variable | Type | Default | Description |
|---|---|---|---|
SCHEDULER_WEIGHT_STUDENT | int | 1 | Deprecated — use group scheduler_weight |
SCHEDULER_WEIGHT_STAFF | int | 2 | Deprecated — use group scheduler_weight |
SCHEDULER_WEIGHT_FACULTY | int | 3 | Deprecated — use group scheduler_weight |
SCHEDULER_WEIGHT_ADMIN | int | 10 | Deprecated — use group scheduler_weight |
SCHEDULER_FAIRNESS_WINDOW | int | 300 | Fairness tracking window (seconds) |
SCHEDULER_DEPRIORITIZE_THRESHOLD | float | 0.5 | Usage threshold for deprioritization |
SCHEDULER_SCORE_MODEL_LOADED | int | 100 | Score bonus for pre-loaded model |
SCHEDULER_SCORE_LOW_UTILIZATION | int | 50 | Score bonus for low GPU utilization |
SCHEDULER_SCORE_LATENCY | int | 40 | Score factor for low latency |
SCHEDULER_SCORE_SHORT_QUEUE | int | 30 | Score factor for short queue |
SCHEDULER_SCORE_HIGH_THROUGHPUT | int | 20 | Score factor for high throughput |
Latency Tracking
| Variable | Type | Default | Description |
|---|---|---|---|
LATENCY_EMA_ALPHA | float | 0.3 | EMA smoothing factor |
LATENCY_EMA_PERSIST_INTERVAL | int | 30 | EMA persistence interval (seconds) |
Backend Registry
| Variable | Type | Default | Description |
|---|---|---|---|
BACKEND_POLL_INTERVAL | int | 30 | Health check interval (seconds) |
BACKEND_HEALTH_TIMEOUT | int | 5 | Health check timeout (seconds) |
BACKEND_UNHEALTHY_THRESHOLD | int | 3 | Failed checks before marking unhealthy |
BACKEND_CIRCUIT_BREAKER_THRESHOLD | int | 3 | Failures before circuit opens |
BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDS | int | 30 | Circuit breaker recovery time |
BACKEND_ADAPTIVE_POLL_FAST_INTERVAL | int | 10 | Fast poll interval after unhealthy |
BACKEND_ADAPTIVE_POLL_FAST_DURATION | int | 120 | Duration of fast polling (seconds) |
Request Handling
| Variable | Type | Default | Description |
|---|---|---|---|
MAX_REQUEST_SIZE | int | 52428800 | Max HTTP request body (50 MB) |
BACKEND_REQUEST_TIMEOUT | int | 300 | Total request timeout (seconds) |
BACKEND_REQUEST_TIMEOUT_PER_ATTEMPT | int | 180 | Per-attempt timeout (seconds, high for large model prefills) |
BACKEND_RETRY_MAX_ATTEMPTS | int | 3 | Max total retry attempts |
BACKEND_RETRY_ATTEMPTS | int | 2 | Default retry attempts (deprecated — not currently used by retry logic) |
BACKEND_RETRY_BACKOFF | float | 1.0 | Retry backoff multiplier (deprecated — not currently used by retry logic) |
STRUCTURED_OUTPUT_RETRY_ON_INVALID | bool | true | Retry on invalid structured output |
Logging
| Variable | Type | Default | Description |
|---|---|---|---|
LOG_LEVEL | str | INFO | Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
LOG_FORMAT | str | json | Log format (json or text) |
LOG_FILE | str | None | Log file path (optional, stdout if not set) |
Audit Logging
| Variable | Type | Default | Description |
|---|---|---|---|
AUDIT_LOG_ENABLED | bool | true | Enable audit logging |
AUDIT_LOG_PROMPTS | bool | true | Log user prompts |
AUDIT_LOG_RESPONSES | bool | true | Log LLM responses |
Telemetry & GPU
| Variable | Type | Default | Description |
|---|---|---|---|
TELEMETRY_RETENTION_DAYS | int | 30 | Telemetry data retention period |
TELEMETRY_CLEANUP_INTERVAL | int | 3600 | Cleanup interval (seconds) |
SIDECAR_TIMEOUT | int | 5 | Sidecar HTTP call timeout (seconds) |
Observability
| Variable | Type | Default | Description |
|---|---|---|---|
METRICS_ENABLED | bool | true | Enable Prometheus metrics |
METRICS_PREFIX | str | mindrouter | Metrics name prefix |
OTEL_ENABLED | bool | false | Enable OpenTelemetry |
OTEL_EXPORTER_OTLP_ENDPOINT | str | None | OpenTelemetry exporter endpoint |
CORS
| Variable | Type | Default | Description |
|---|---|---|---|
CORS_ORIGINS | list | ["http://localhost:3000", "http://localhost:8000"] | Allowed origins (JSON array or comma-separated) |
Chat UI
| Variable | Type | Default | Description |
|---|---|---|---|
CHAT_FILES_PATH | str | /data/chat_files | Chat file upload directory |
CHAT_UPLOAD_MAX_SIZE_MB | int | 10 | Max upload file size (MB) |
CHAT_UPLOAD_ALLOWED_EXTENSIONS | list | See below | Allowed upload file extensions |
Default allowed extensions: .txt, .md, .csv, .json, .html, .htm, .log, .docx, .xlsx, .pdf, .jpg, .jpeg, .png, .gif, .webp
Tokenizer
| Variable | Type | Default | Description |
|---|---|---|---|
DEFAULT_TOKENIZER | str | cl100k_base | Default tokenizer encoding |
Status Enums Reference
MindRouter uses the following status enumerations throughout the system:
BackendStatus
HEALTHY— Backend is online and accepting requestsUNHEALTHY— Backend is failing health checksDISABLED— Backend has been manually disabled by an adminDRAINING— Backend is finishing in-flight requests but not accepting new onesUNKNOWN— Backend status has not yet been determined
NodeStatus
ONLINE— Node sidecar is reachable and reporting metricsOFFLINE— Node sidecar is unreachableUNKNOWN— Node status has not yet been determined
RequestStatus
QUEUED— Request is waiting for a backend slotPROCESSING— Request is being handled by a backendCOMPLETED— Request finished successfullyFAILED— Request encountered an errorCANCELLED— Request was cancelled (e.g., client disconnect)
Production Security Hardening
When deploying MindRouter to production, apply the following security measures:
- Secure session cookies — Set
SESSION_COOKIE_SECURE=Trueto ensure session cookies are only sent over HTTPS. - Security headers — Add
Strict-Transport-Security(HSTS),X-Frame-Options, andContent-Security-Policy(CSP) headers at the reverse proxy layer (e.g., nginx). - Restrict CORS origins — Set
CORS_ORIGINSto only the specific domains that need API access. Do not use*in production.
Deployment
MindRouter is designed for deployment on Linux servers with NVIDIA GPUs. The full deployment guide covers:
- Rocky Linux 8 prerequisites and dependency installation
- SSL/TLS configuration (self-signed and Let's Encrypt)
- Apache reverse proxy setup
- Firewall and SELinux configuration
- Docker Compose production stack
- Database migrations
- GPU sidecar agent deployment
- Node and backend registration
- Verification and ongoing operations
GPU Sidecar Deployment
The GPU sidecar agent runs on each inference node to expose per-GPU hardware metrics and enable auto-discovery of inference endpoints. Build and deploy directly from GitHub — no need to clone the repository.
Create the env file (once per node)
sudo mkdir -p /etc/mindrouter
python3 -c "import secrets; print('SIDECAR_SECRET_KEY=' + secrets.token_hex(32))" | sudo tee /etc/mindrouter/sidecar.env
sudo chmod 600 /etc/mindrouter/sidecar.env
Build a specific version
docker build -t mindrouter-sidecar:v2.0.0 \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
Build latest from master
docker build -t mindrouter-sidecar:latest \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git:sidecar
Run the sidecar
# Run bound to localhost only (nginx will proxy external traffic)
docker run -d --name gpu-sidecar \
--gpus all \
-p 127.0.0.1:18007:8007 \
--env-file /etc/mindrouter/sidecar.env \
--restart unless-stopped \
mindrouter-sidecar:v2.0.0
Upgrade to a new version
docker build -t mindrouter-sidecar:v2.0.0 \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
docker stop gpu-sidecar && docker rm gpu-sidecar
docker run -d --name gpu-sidecar \
--gpus all \
-p 127.0.0.1:18007:8007 \
--env-file /etc/mindrouter/sidecar.env \
--restart unless-stopped \
mindrouter-sidecar:v2.0.0
In production, bind the sidecar to localhost only (as shown above) and use an nginx reverse proxy on port 8007 to handle external traffic.
The env file at /etc/mindrouter/sidecar.env persists the secret key across upgrades — generate it once per node and it's reused automatically.
Testing
MindRouter has a comprehensive test suite covering unit, integration, end-to-end, smoke, stress, and accessibility tests.
Quick Reference
| Command | Description |
|---|---|
make test-unit | Run unit tests (525+ tests) |
make test-int | Integration tests (requires live backends) |
make test-e2e | End-to-end tests |
make test-smoke | Smoke tests (full API surface) |
make test-stress | Load/stress tests |
make test-a11y | WCAG 2.1 accessibility tests |
make test-sidecar | GPU sidecar agent tests |
make test-all | Run all test suites |