MindRouter Documentation

MindRouter is a production-ready LLM inference load balancer and translation layer that fronts a heterogeneous cluster of Ollama and vLLM inference backends. It provides a unified OpenAI-compatible API surface with native Ollama compatibility, fair-share scheduling, per-user quotas, full audit logging, and real-time GPU telemetry.

Developed by Luke Sheneman, Research Computing and Data Services (RCDS), Institute for Interdisciplinary Data Sciences (IIDS), University of Idaho.

Overview

MindRouter sits between API consumers and GPU inference servers, providing:

Unified API Gateway — OpenAI-compatible /v1/*, Ollama-compatible /api/*, and Anthropic-compatible /anthropic/v1/* endpoints, all backed by the same pool of inference servers.
Cross-Engine Routing — A request arriving in OpenAI, Ollama, or Anthropic format can be served by any backend. The translation layer handles all protocol conversion transparently.
Fair-Share Scheduling — Weighted Deficit Round Robin (WDRR) ensures equitable GPU access across users in different groups with configurable priorities.
Multi-Modal Support — Text chat, text completion, embeddings, multimodal models, structured JSON outputs, and tool calling (function calling).
Per-User Quotas — Token budgets, requests-per-minute limits, and concurrent request caps, with defaults inherited from the user's group.
Full Audit Logging — Every prompt, response, and token count is recorded for compliance and review.
Real-Time GPU Telemetry — Per-GPU utilization, memory, temperature, and power metrics via lightweight sidecar agents.
Web Dashboards — Public status page, user self-service dashboard, admin control panel, and built-in chat interface.

Who It's For

Research computing centers managing shared GPU clusters for multiple user groups
Universities providing LLM access to students, staff, and faculty with differentiated quotas
Organizations needing a unified API gateway across mixed Ollama/vLLM infrastructure

Architecture

MindRouter follows a layered architecture:

Client Request (OpenAI, Ollama, or Anthropic format)
        |
        v
+-----------------------------+
|     API Gateway Layer       |  ← /v1/*, /api/*, /anthropic/*, /api/admin/*
+-----------------------------+
|  Authentication & Quotas    |  ← API key verification, rate limiting
+-----------------------------+
|    Translation Layer        |  ← OpenAI/Ollama/Anthropic ↔ Canonical ↔ Ollama/vLLM
+-----------------------------+
|   Fair-Share Scheduler      |  ← WDRR with per-user deficit counters
+-----------------------------+
|    Backend Registry         |  ← Health monitoring, model tracking
+-----------------------------+
        |
        v
+---------------+-------------+
|  GPU Node 1   |  GPU Node 2 |  ...
|  +---------+  |  +--------+ |
|  | Sidecar |  |  |Sidecar | |  ← Per-node GPU metrics agent
|  +---------+  |  +--------+ |
|  | Ollama  |  |  |  vLLM  | |  ← Inference engines
|  +---------+  |  +--------+ |
+---------------+-------------+

Key concepts:

A Node is a physical GPU server running a sidecar agent.
A Backend is an inference endpoint (Ollama or vLLM instance) running on a node. Multiple backends can share a node, each assigned specific GPUs via gpu_indices.

Getting Started

Prerequisites

Docker and Docker Compose
Python 3.11+ (for local development)

Quickstart with Docker Compose

# 1. Clone and configure
git clone <repository-url>
cd mindrouter
cp .env.example .env
nano .env  # Set DATABASE_URL, SECRET_KEY, etc.

# 2. Start all services
docker compose up --build

# 3. Run database migrations
docker compose exec app alembic upgrade head

# 4. Seed development data (creates users, quotas, API keys)
docker compose exec app python scripts/seed_dev_data.py

Default Development Credentials

After running the seed script:

Username	Password	Group	Scheduler Weight
`admin`	`admin123`	Admin	10
`faculty1`	`faculty123`	Faculty	3
`staff1`	`staff123`	Staff	2
`student1`	`student123`	Student	1

Accessing the Application

URL	Description
`http://localhost:8000/`	Public status page
`http://localhost:8000/dashboard`	User dashboard (login required)
`http://localhost:8000/admin`	Admin dashboard (admin group required)
`http://localhost:8000/chat`	Chat interface (login required)
`http://localhost:8000/docs`	Interactive API docs (Swagger UI)
`http://localhost:8000/redoc`	API reference (ReDoc)

API Reference

Interactive API Documentation

MindRouter includes built-in interactive API documentation powered by FastAPI:

Swagger UI at /docs — Interactive API explorer where you can try endpoints directly from your browser. Supports authentication via the "Authorize" button (enter your API key as a Bearer token).
ReDoc at /redoc — Clean, readable API reference with request/response schemas and examples.

Both are auto-generated from the application's route definitions and Pydantic models, so they always reflect the current API surface.

Authentication

All inference and admin endpoints require authentication. MindRouter supports two methods:

API Key (Bearer Token):

curl -H "Authorization: Bearer mr2_your-api-key" http://localhost:8000/v1/models

API Key (Header):

curl -H "X-API-Key: mr2_your-api-key" http://localhost:8000/v1/models

Session Cookie (dashboard/admin AJAX only): Browser-based dashboard calls authenticate via the mindrouter_session cookie set at login. This is used internally by the web UI and is not intended for programmatic access.

Error Responses

All error responses follow a consistent format:

{
  "detail": "Human-readable error message"
}

Common HTTP status codes:

Code	Meaning
400	Invalid request body or parameters
401	Missing or invalid API key
403	Insufficient permissions (e.g., non-admin accessing admin endpoint)
404	Resource not found
409	Conflict (duplicate name, URL, etc.)
429	Rate limit exceeded
500	Internal server error

Error Response Formats by API Style

Error response formats differ depending on which API style the client is using:

OpenAI (/v1/*) — Returns a nested error object: {"error": {"message": "...", "type": "...", "code": ...}}
Ollama (/api/*) — Returns a plain detail string: {"detail": "..."}
Anthropic (/anthropic/v1/*) — Returns a typed error object: {"type": "error", "error": {"type": "...", "message": "..."}}

Model Name Matching

Model names in requests are matched exactly against the model catalog. There is no prefix matching, alias resolution, or fuzzy matching. The model field must match a model name as listed by /v1/models or /api/tags.

OpenAI-Compatible Endpoints

These endpoints accept and return data in the OpenAI API format. Any OpenAI-compatible client or SDK can be pointed at MindRouter by changing the base URL.

Method	Path	Auth	Description
POST	`/v1/chat/completions`	API Key	Chat completions (streaming and non-streaming)
POST	`/v1/completions`	API Key	Text completions (legacy)
POST	`/v1/embeddings`	API Key	Generate embeddings
POST	`/v1/rerank`	API Key	Rerank documents against a query
POST	`/v1/score`	API Key	Score similarity between text pairs
GET	`/v1/models`	API Key	List available models

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": false
  }'

Response:

{
  "id": "chatcmpl-abc123...",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama3.2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 10,
    "total_tokens": 35
  }
}

Streaming — Set "stream": true to receive Server-Sent Events (SSE):

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "stream": true}'

Thinking/Reasoning Mode:

MindRouter supports multiple formats for controlling thinking/reasoning on models that support it (qwen3.5, qwen3, gpt-oss):

// gpt-oss: control reasoning depth
{
  "model": "openai/gpt-oss-120b",
  "messages": [{"role": "user", "content": "Solve this step by step"}],
  "reasoning_effort": "high",
  "max_completion_tokens": 16384
}

// Qwen-style: toggle thinking on/off
{
  "model": "qwen/qwen3.5-400b",
  "messages": [{"role": "user", "content": "Explain quantum computing"}],
  "chat_template_kwargs": {"enable_thinking": true},
  "max_completion_tokens": 16384
}

When thinking is enabled, the response includes reasoning_content alongside content.

Important: Thinking models can consume large numbers of output tokens for reasoning. Use max_completion_tokens (or max_tokens) to set an adequate budget — 16384 is recommended for qwen3.5-400b with thinking enabled.

Output Token Limits:

MindRouter accepts both max_completion_tokens (preferred, current OpenAI standard) and max_tokens (legacy). If both are provided, max_completion_tokens takes priority.

Structured Output (JSON Mode):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "List 3 colors as JSON"}],
  "response_format": {"type": "json_object"}
}

Structured Output (JSON Schema):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "List 3 colors"}],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "colors",
      "schema": {
        "type": "object",
        "properties": {
          "colors": {"type": "array", "items": {"type": "string"}}
        }
      }
    }
  }
}

Vision (Multimodal):

{
  "model": "llava",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}

Tool Calling (Function Calling):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What's the weather in Seattle?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }
  }],
  "tool_choice": "auto"
}

When the model decides to call a tool, the response includes tool_calls with finish_reason: "tool_calls". Submit the tool result back as a role: "tool" message with the matching tool_call_id.

Embeddings

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic-embed-text", "input": "Hello world"}'

Reranking

curl -X POST http://localhost:8000/v1/rerank \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Reranker-8B",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of AI.",
      "The weather is sunny today.",
      "Deep learning uses neural networks."
    ],
    "top_n": 2
  }'

Scoring

curl -X POST http://localhost:8000/v1/score \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Reranker-8B",
    "text_1": "What is machine learning?",
    "text_2": ["Machine learning is a subset of AI.", "The weather is sunny today."]
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer mr2_your-api-key"

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama3.3:70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "mindrouter",
      "capabilities": {
        "multimodal": false,
        "embeddings": false,
        "structured_output": true,
        "thinking": false
      },
      "backends": ["node1-gpu0", "node1-gpu2"],
      "context_length": 32768,
      "model_max_context": 131072,
      "parameter_count": "70B",
      "quantization": "Q4_K_M",
      "family": "llama"
    }
  ]
}

Fields include:

context_length — effective context window (num_ctx injected per request)
model_max_context — architectural maximum context the model supports
parameter_count — model size (e.g. "7B", "70B")
quantization — quantization level (e.g. "Q4_K_M", "FP16")
family — model family (e.g. "llama", "qwen2")
capabilities.thinking — whether the model supports thinking/reasoning mode

Ollama-Compatible Endpoints

These endpoints accept and return data in Ollama's native format. Ollama clients can be pointed at MindRouter as a drop-in replacement.

Method	Path	Auth	Description
POST	`/api/chat`	API Key	Ollama chat (streaming by default)
POST	`/api/generate`	API Key	Ollama text generation
POST	`/api/embeddings`	API Key	Ollama embeddings
GET	`/api/tags`	API Key	List models (Ollama format)

List Models (Ollama Format)

curl http://localhost:8000/api/tags \
  -H "Authorization: Bearer mr2_your-api-key"

Response:

{
  "models": [
    {
      "name": "llama3.3:70b",
      "model": "llama3.3:70b",
      "modified_at": "2026-02-28T12:00:00",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "parameter_size": "70B",
        "quantization_level": "Q4_K_M"
      },
      "context_length": 32768,
      "model_max_context": 131072
    }
  ]
}

Ollama Chat

curl -X POST http://localhost:8000/api/chat \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Note: Ollama defaults to stream: true. Set "stream": false explicitly for non-streaming responses.

Thinking/Reasoning Mode (Ollama):

Use the think field at the top level:

// Qwen-style: boolean toggle
{"model": "qwen3-32k:32b", "messages": [...], "think": true, "stream": false}

// gpt-oss: string effort level
{"model": "gpt-oss-32k:120b", "messages": [...], "think": "high", "stream": false}

The response includes a thinking field in the message. For /api/generate, thinking content appears as a top-level thinking field alongside response.

Ollama Generate

curl -X POST http://localhost:8000/api/generate \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'

Anthropic-Compatible Endpoint

This endpoint accepts and returns data in the Anthropic Messages API format. Anthropic SDK clients (Python, TypeScript) can be pointed at MindRouter by setting base_url.

Method	Path	Auth	Description
POST	`/anthropic/v1/messages`	API Key	Anthropic Messages API (streaming and non-streaming)

Messages

curl -X POST http://localhost:8000/anthropic/v1/messages \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "max_tokens": 500,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Response:

{
  "id": "msg_abc123...",
  "type": "message",
  "role": "assistant",
  "model": "llama3.2",
  "content": [
    {"type": "text", "text": "Hello! How can I help you today?"}
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 10,
    "output_tokens": 12
  }
}

Streaming — Set "stream": true to receive Anthropic SSE events (message_start, content_block_delta, message_stop, etc.):

curl -X POST http://localhost:8000/anthropic/v1/messages \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "max_tokens": 500, "messages": [{"role": "user", "content": "Hi"}], "stream": true}'

System Prompt:

{
  "model": "llama3.2",
  "max_tokens": 500,
  "system": "You are a helpful assistant.",
  "messages": [{"role": "user", "content": "Hello!"}]
}

SDK Usage (Python)

import anthropic
client = anthropic.Anthropic(
    base_url="http://localhost:8000/anthropic",
    api_key="mr2_your-api-key",
)
message = client.messages.create(
    model="llama3.2",
    max_tokens=500,
    messages=[{"role": "user", "content": "Hello!"}],
)

Supported features:

System prompts (string or content block array)
Multimodal inputs (base64 and URL images)
Tool calling — tools with input_schema, tool_choice (auto/any/tool), tool_use/tool_result content blocks, streaming tool use with input_json_delta
Thinking/reasoning mode (thinking.type: enabled, adaptive, disabled)
Structured output via output_config.format with type: "json_schema"
Parameters: max_tokens, temperature, top_p, top_k, stop_sequences, stream
metadata.user_id mapping

Note: This is inbound-only — there are no Anthropic backends. Requests are translated to canonical format and routed to Ollama/vLLM backends like any other request.

Health & Metrics Endpoints

These endpoints are unauthenticated and intended for monitoring infrastructure.

Method	Path	Auth	Description
GET	`/healthz`	None	Liveness probe (always 200 if app is running)
GET	`/readyz`	None	Readiness probe (checks DB + healthy backends)
GET	`/metrics`	None	Prometheus metrics (text format)
GET	`/status`	None	Cluster status summary (JSON)
GET	`/api/cluster/total-tokens`	None	Total tokens ever served (cached 10s)
GET	`/api/cluster/trends`	None	Token and active-user trends (`?range=hour\|day\|week\|month\|year`)
GET	`/api/cluster/throughput`	None	Token throughput, requests/min, active requests

Prometheus Metrics

The /metrics endpoint exposes the following Prometheus metrics:

Metric	Type	Labels	Description
`mindrouter_requests_total`	Counter	`endpoint`, `status`	Total requests processed
`mindrouter_request_latency_seconds`	Histogram	`endpoint`	Request latency
`mindrouter_queue_size`	Gauge	—	Current scheduler queue depth
`mindrouter_active_backends`	Gauge	—	Number of healthy backends
`mindrouter_tokens_total`	Counter	`type` (prompt/completion)	Total tokens processed

Admin API Endpoints

All admin endpoints require membership in a group with is_admin = true. They are mounted under /api/admin/.

Backend Management

Method	Path	Description
POST	`/api/admin/backends/register`	Register a new inference backend
GET	`/api/admin/backends`	List all backends
PATCH	`/api/admin/backends/{id}`	Update backend properties
POST	`/api/admin/backends/{id}/disable`	Disable a backend
POST	`/api/admin/backends/{id}/enable`	Enable a disabled backend
POST	`/admin/backends/{id}/drain`	Initiate graceful drain (stops new requests, waits for in-flight to complete, then disables). Dashboard route only, not available via API.
POST	`/api/admin/backends/{id}/refresh`	Force-refresh capabilities and models
POST	`/api/admin/backends/{id}/ollama/pull`	Start pulling a model to an Ollama backend
GET	`/api/admin/backends/{id}/ollama/pull/{job_id}`	Poll progress of an Ollama model pull
POST	`/api/admin/backends/{id}/ollama/delete`	Delete a model from an Ollama backend

Node Management

Method	Path	Description
POST	`/api/admin/nodes/register`	Register a new GPU node
GET	`/api/admin/nodes`	List all nodes
PATCH	`/api/admin/nodes/{id}`	Update node properties
DELETE	`/api/admin/nodes/{id}`	Delete a node (fails if backends reference it)
POST	`/api/admin/nodes/{id}/refresh`	Force-refresh sidecar data

User Management

Method	Path	Description
GET	`/api/admin/users`	List users (filterable by group, searchable by name/email)
POST	`/api/admin/users`	Create a new user with group-based quota defaults
GET	`/api/admin/users/{id}`	User detail with usage stats, API keys, monthly usage
PATCH	`/api/admin/users/{id}`	Update user profile, group, and quota overrides
DELETE	`/api/admin/users/{id}`	Hard-delete a user and all associated data
POST	`/api/admin/users/{id}/api-keys`	Create an API key for a user

Group Management

Method	Path	Description
GET	`/api/admin/groups`	List all groups with user counts
POST	`/api/admin/groups`	Create a new group with quota defaults
PATCH	`/api/admin/groups/{id}`	Update group defaults (token budget, RPM, weight, etc.)
DELETE	`/api/admin/groups/{id}`	Delete a group (fails if users are assigned)

API Key Management

Method	Path	Description
GET	`/api/admin/api-keys`	List all API keys across users (searchable, filterable by status)

Quota Management

Method	Path	Description
GET	`/api/admin/quota-requests`	List pending quota increase requests
POST	`/api/admin/quota-requests/{id}/review`	Approve or deny a quota request

Conversations

Method	Path	Description
GET	`/admin/conversations/export`	Export conversations as CSV or JSON (form-based, filterable)
GET	`/api/admin/conversations/export`	Bulk API export (JSON with content, for programmatic access)

Queue & Audit

Method	Path	Description
GET	`/api/admin/queue`	Scheduler queue statistics
GET	`/api/admin/audit/search`	Search audit logs (filter by user, model, status, date, text)
GET	`/api/admin/audit/{id}`	Full audit detail including prompt and response content

Telemetry

Method	Path	Description
GET	`/api/admin/telemetry/overview`	Cluster-wide telemetry (nodes, backends, GPUs)
GET	`/api/admin/telemetry/latest`	Lightweight polling endpoint for dashboard
GET	`/api/admin/telemetry/backends/{id}/history`	Time-series telemetry for a backend
GET	`/api/admin/telemetry/gpus/{id}/history`	Time-series telemetry for a GPU device
GET	`/api/admin/telemetry/nodes/{id}/history`	Aggregated time-series for a node (all GPUs)
GET	`/api/admin/telemetry/export`	Export raw telemetry as JSON or CSV

Web Dashboard

MindRouter includes a full web dashboard built with Bootstrap 5. All pages extend a common base template with navigation and accessibility features (WCAG 2.1 Level AA).

Public Pages

Page	URL	Description
Cluster Status	`/`	Shows healthy backend count, available models, queue size, and overall cluster status
Login	`/login`	Username/password authentication (Azure AD SSO when configured)
Blog	`/blog`	Public blog listing with published posts
Blog Post	`/blog/{slug}`	Individual blog post rendered from Markdown

User Dashboard

Page	URL	Description
Dashboard	`/dashboard`	Token usage progress bar, active API keys, quota usage history, change password
Change Password	`POST /dashboard/change-password`	Change password for local (non-SSO) accounts. Requires current password, new password (min 8 chars), and confirmation. Not available for Azure AD SSO users.
Request Quota	`/dashboard/request-quota`	Submit a quota increase request with justification
Key Created	(after creation)	Displays the full API key once (copy-to-clipboard)

The user dashboard includes:

Dark mode toggle — saved to browser localStorage, persists across sessions
Live token usage — usage statistics poll every 1 second for real-time feedback without page refresh
Lifetime vs rolling usage — displays both Lifetime Token Usage (all-time total, never resets) and Current Period Usage (resets when the budget period rolls over)
Quota details — current RPM limit and max concurrent requests shown in the Quota Details card
API key expiration warnings — keys within 7 days of expiration show a yellow countdown; expired keys display an “Expired” badge; “Last Used” column shows last authentication time

Admin Dashboard

The admin dashboard has a persistent sidebar with links to all admin pages. Access requires membership in a group with is_admin = true.

Page	URL	Description
Overview	`/admin`	System metrics overview, health alert banner for unhealthy backends/offline nodes, pending request badges, system force offline/online toggle
Backends	`/admin/backends`	Backend health, models, enable/disable/drain controls
Nodes	`/admin/nodes`	GPU node management, sidecar status, hardware specs, take offline/bring online/force drain controls
Models	`/admin/models`	Model management: multimodal and thinking overrides, full metadata overrides (family, parameters, quantization, context length, etc.), Ollama pull/delete
GPU Metrics	`/admin/metrics`	Real-time GPU utilization, memory, temperature, power charts with time range controls
Users	`/admin/users`	User accounts with search, group filter, and pagination
User Detail	`/admin/users/{id}`	Individual user profile, usage stats, API keys, monthly usage chart, quota overrides, masquerade (view dashboard as user)
Groups	`/admin/groups`	Group management: create, edit, delete groups with quota defaults and scheduler weights
API Keys	`/admin/api-keys`	All API keys across users with search, status filter, and pagination
Requests	`/admin/requests`	Pending API key and quota increase requests, approve/deny
Audit Log	`/admin/audit`	Inference request history with filtering, search, and CSV/JSON export
Conversations	`/admin/conversations`	Browse and export all user conversations
Chat Configuration	`/admin/chat-config`	Configure chat UI defaults: core models, default model, system prompt, max tokens, temperature, thinking mode
Blog Management	`/admin/blog`	Blog CMS: create, edit, publish/unpublish, and delete blog posts
Site Settings	`/admin/settings`	Global settings: display timezone, Ollama context length enforcement

System Force Offline/Online

Administrators can force the entire MindRouter system offline from the admin overview page (/admin). This is useful for planned maintenance windows or emergency situations.

Force Offline (POST /admin/system/toggle-online) — Stops the backend polling loop, marks all backends as unhealthy in the database, and sets an internal _force_offline flag. While offline, no health checks or discovery runs, and no inference requests can be served.
Force Online — Clears the offline flag, closes and reloads all backend adapters and sidecar clients from the database, restarts the polling loop, and immediately runs a full poll cycle to restore backend health status.

The toggle is available as a button on the admin overview page. The system status is visible to all users on the public status page.

Node Lifecycle Management

Beyond basic node registration and editing, administrators can manage node operational state from the Nodes page (/admin/nodes):

Action	Route	Description
Take Offline	`POST /admin/nodes/{id}/take-offline`	Disables all backends on the node and marks the node status as OFFLINE. New requests will not be routed to any backend on this node.
Bring Online	`POST /admin/nodes/{id}/bring-online`	Re-enables all backends on the node and marks the node status as ONLINE. Backends will begin receiving requests again after the next health poll.
View Active Requests	`GET /admin/nodes/{id}/active-requests`	Returns a JSON count of in-flight requests across all backends on the node. Useful for monitoring drain progress before taking a node offline.
Force Drain	`POST /admin/nodes/{id}/force-drain`	Force-cancels all active (in-flight) requests on the node's backends. Returns the number of cancelled requests. Use this when you need to take a node offline immediately without waiting for requests to complete naturally.

Recommended workflow for node maintenance:

Take the node offline to stop new requests from being routed there.
Monitor active requests using the active requests endpoint until the count reaches zero.
If requests are stuck, use force drain to cancel them.
Perform maintenance (upgrade vLLM, restart Ollama, etc.).
Bring the node back online.

Admin Models Page

The Models page at /admin/models provides comprehensive model management. Models are grouped by name (since the same model may appear on multiple backends) and presented with their current metadata and override status.

Capability Overrides

MindRouter auto-detects model capabilities during discovery, but admins can override the detected values:

Multimodal override — Toggle or reset the supports_multimodal flag for all instances of a model. Use this when auto-detection incorrectly classifies a model (e.g., a vision model whose name does not match the usual patterns).
Thinking override — Toggle or reset the supports_thinking flag for all instances. Use this for models that support thinking/reasoning mode but are not detected automatically.
Resetting an override returns the field to auto-detected values from the next discovery cycle.

Metadata Overrides

Admins can override any model metadata field across all instances of a model via the metadata edit form. Overridable fields:

Field	Type	Description
`family`	string	Model family (e.g., `llama`, `qwen2`)
`parameter_count`	string	Model size (e.g., `7B`, `70B`)
`quantization`	string	Quantization level (e.g., `Q4_K_M`, `FP16`)
`context_length`	int	Effective context window (injected as `num_ctx`). Overrides the auto-discovered value.
`embedding_length`	int	Embedding dimension length
`head_count`	int	Number of attention heads
`layer_count`	int	Number of transformer layers
`feed_forward_length`	int	Feed-forward network dimension
`model_format`	string	Model format (e.g., `gguf`, `safetensors`)
`parent_model`	string	Parent model identifier
`capabilities`	string	Comma-separated list of capabilities (stored as JSON array)
`description`	string	Human-readable model description
`model_url`	string	URL to model documentation or repository

Clearing a field (setting it to blank) removes the override and reverts to the auto-discovered value. Admins can also reset all overrides at once via the "Reset All Overrides" action.

Ollama Model Pull and Delete

The Models page lists all Ollama backends and provides UI controls to:

Pull a model — Trigger a model download on a specific Ollama backend via the admin API (POST /api/admin/backends/{id}/ollama/pull). Progress can be polled until completion.
Delete a model — Remove a model from a specific Ollama backend via the admin API (POST /api/admin/backends/{id}/ollama/delete).

Admin Masquerade

Administrators can "masquerade" as another user to view the user dashboard from that user's perspective. This is useful for troubleshooting user-reported issues or verifying quota and usage display.

Start masquerade (POST /admin/masquerade/{target_user_id}) — Sets a signed cookie (mindrouter_masquerade) containing the target user ID. The cookie is signed using itsdangerous.URLSafeTimedSerializer with the application's SECRET_KEY and a "masquerade" salt. The cookie expires after 24 hours.
During masquerade — The user dashboard (/dashboard) shows the target user's data (usage stats, API keys, quota, conversations) instead of the admin's own data. The masquerade applies only to read-only dashboard views — admin routes and actions are never affected.
Stop masquerade (POST /admin/masquerade/stop) — Deletes the masquerade cookie and redirects back to the admin users page.
Security — Only users in an admin group can start a masquerade. The target user ID is verified to exist before the cookie is set. The signed cookie prevents tampering.

Masquerade is initiated from the user detail page (/admin/users/{id}) via a "View as User" button.

Audit Log Export

Administrators can export audit log data as CSV or JSON via GET /admin/audit/export. The export supports the same filters as the audit log UI:

Parameter	Type	Description
`format`	string	Output format: `csv` (default) or `json`
`search`	string	Free-text search across request fields
`user_id_filter`	int	Filter by user ID
`model_filter`	string	Filter by model name
`status_filter`	string	Filter by request status
`start_date`	string	Start date (ISO 8601 format)
`end_date`	string	End date (ISO 8601 format)
`include_content`	bool	Include prompt/response content in export (default: false)

Export fields: request_uuid, created_at, user_id, model, endpoint, status, prompt_tokens, completion_tokens, total_tokens, total_time_ms, error_message. When include_content=true, additional fields are included: messages, prompt, parameters, response_content, finish_reason.

Exports are limited to 10,000 records. The file is downloaded as an attachment (audit_log.csv or audit_log.json).

Admin Chat Configuration

The Chat Configuration page at /admin/chat-config allows administrators to control default settings for the built-in chat interface. All settings are stored in the AppConfig database and take effect immediately.

Setting	Description	Details
Core Models	Subset of models shown in the chat model selector	Select from all available (non-embedding) models on healthy backends. When set, only these models appear in the chat UI dropdown. Empty means show all models.
Default Model	Pre-selected model in the chat UI	The model automatically selected when a user starts a new conversation. Blank means no default.
System Prompt	Global system prompt for all chat conversations	Prepended to every chat conversation. Blank removes the override and reverts to the built-in default. Can also be explicitly reset via a "Reset" button.
Max Tokens	Default max tokens for chat requests	Range: 256–131072. Default: 16384. Clamped to the allowed range on save.
Temperature	Default temperature for chat responses	Range: 0.0–2.0. Blank means use the model's default temperature.
Thinking Mode	Default thinking/reasoning mode	Options: `true`, `false`, `low`, `medium`, `high`, or blank (no default). Controls whether thinking is enabled by default for models that support it.

Chat Interface

Page	URL	Description
Chat	`/chat`	Full-featured chat UI with model selection, streaming, file upload, multimodal support

The chat interface supports:

Collapsible conversation sidebar
Model and backend selection
Real-time streaming responses
File upload via button or drag-and-drop anywhere in the chat window (images, PDFs, DOCX, XLSX, CSV, JSON, Markdown, etc.)
Vision model support with automatic image handling
Web search toggle — when enabled, queries the Brave Search API and injects results as context into the system prompt (requires BRAVE_SEARCH_API_KEY)
Code syntax highlighting
LaTeX rendering
Message editing and deletion
Dark mode toggle (stored in browser localStorage)
Advanced models toggle — show or hide non-core models in the model selector
Per-request thinking controls — enable/disable thinking mode and set reasoning effort for supported models
Collapsible thinking blocks in assistant responses
Keyboard shortcuts — Shift+Enter for newline, Enter to send
Sidebar collapse for a wider chat area
Copy buttons on code blocks and assistant messages
Image lightbox for viewing uploaded images full-size
Auto-titling of conversations based on the first message

Users, Groups & Quotas

Group System

MindRouter uses a database-driven group system for authorization, quota defaults, and scheduler weights. Each user belongs to exactly one group. Groups are fully manageable via the admin UI at /admin/groups.

Admin privileges are determined by the group's is_admin flag. Users in an admin group can access the admin dashboard and admin API endpoints.

Default Groups

Group	Token Budget	RPM	Max Concurrent	Sched. Weight	Admin
`students`	100,000	30	2	1	No
`staff`	500,000	60	4	2	No
`faculty`	1,000,000	120	8	3	No
`researchers`	1,000,000	120	8	3	No
`admin`	10,000,000	1,000	50	10	Yes
`nerds`	500,000	60	4	2	No
`other`	100,000	30	2	1	No

These 7 groups are created automatically by the database migration. Admins can create additional groups, edit defaults, or delete empty groups via the admin UI or API.

User Profiles

Each user has the following profile fields (editable by admins at /admin/users/{id}):

Username and Email — unique identifiers
Full Name — display name
Group — determines default quotas, scheduler weight, and admin access
College and Department — organizational affiliation
Intended Use — free-text description of how the user plans to use the service

Quota Inheritance & Overrides

When a user is created, their quota defaults are inherited from their group. Admins can override any quota value per-user:

Token budget — inherited from group, overridable per user
RPM limit — inherited from group, overridable per user
Max concurrent — inherited from group, overridable per user
Weight override — if set, overrides the group's scheduler weight for this user

API Key Lifecycle

Generation — Keys use the format mr2_<random_urlsafe_base64> (48+ characters total).
Storage — The raw key is shown once at creation. Only the Argon2 hash and a prefix (mr2_<first 8 chars>) are stored in the database.
Verification — Lookup by prefix (fast), then full Argon2 hash verification.
Expiration — Optional expires_at timestamp.
Revocation — Keys can be revoked (soft-delete) without deleting the audit trail.
Usage tracking — last_used_at and usage_count updated atomically on each request.

Quota System

Each user has a quota record with:

Token budget — Total tokens allowed per period. Deducted on each completed request (prompt + completion tokens).
RPM limit — Maximum requests per minute.
Max concurrent — Maximum simultaneous in-flight requests.

When a quota is exceeded, the request is rejected with HTTP 429.

Rolling Budget Period

Token budgets use rolling 30-day windows per user, not calendar months. Usage is tracked continuously and the oldest usage falls off as the window advances.

Quota Increase Requests

Users can submit quota increase requests via the dashboard (/dashboard/request-quota). The request includes:

Desired token budget
Written justification

Admins review requests at /admin/requests or via POST /api/admin/quota-requests/{id}/review, which can approve (with a custom granted token amount) or deny the request.

Backend Management

Node/Backend Model

MindRouter separates the concept of physical GPU servers (Nodes) from inference endpoints (Backends):

A Node represents a physical server with GPUs and a sidecar agent.
A Backend is an Ollama or vLLM instance running on a node.
One node can host multiple backends, each assigned specific GPUs via gpu_indices.
Backends without a node_id work as standalone endpoints (no GPU telemetry).

Node: gpu-server-1 (4x A100-80GB, sidecar at :8007)
+-- Backend: vllm-large  (gpu_indices: [0, 1])  ← uses GPUs 0-1
+-- Backend: vllm-small  (gpu_indices: [2])      ← uses GPU 2
+-- Backend: ollama-misc (gpu_indices: [3])      ← uses GPU 3

Supported Engines

Engine	Health Check	Model Discovery	Telemetry Source
Ollama	`GET /api/tags`	`GET /api/tags` + `POST /api/ps` (loaded models)	Sidecar agent
vLLM	`GET /health` (fallback: `GET /v1/models`)	`GET /v1/models`	`GET /metrics` (Prometheus format)

Registration

Register a node:

curl -X POST http://localhost:8000/api/admin/nodes/register \
  -H "Authorization: Bearer admin-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpu-server-1",
    "hostname": "gpu1.example.com",
    "sidecar_url": "http://gpu1.example.com:8007",
    "sidecar_key": "your-sidecar-secret-key"
  }'

Register a backend on that node:

curl -X POST http://localhost:8000/api/admin/backends/register \
  -H "Authorization: Bearer admin-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ollama-gpu1",
    "url": "http://gpu1.example.com:11434",
    "engine": "ollama",
    "max_concurrent": 4,
    "node_id": 1,
    "gpu_indices": [0, 1]
  }'

Enable/Disable/Drain/Refresh

Disable a backend to take it out of rotation without deleting it: POST /api/admin/backends/{id}/disable
Enable to bring it back: POST /api/admin/backends/{id}/enable
Drain for graceful offline maintenance: POST /admin/backends/{id}/drain (dashboard-only route)
Refresh to force re-discovery of models and capabilities: POST /api/admin/backends/{id}/refresh

Drain Mode

Drain mode provides graceful backend shutdown for maintenance. When a backend is set to draining:

The scheduler stops routing new requests to the backend.
All in-flight requests are allowed to complete normally.
When the backend's queue depth reaches 0, it automatically transitions to disabled.

This avoids abruptly killing active requests when you need to restart an inference engine, upgrade models, or perform node maintenance. Use it before upgrading vLLM, restarting Ollama, or taking a GPU node offline.

Concurrency Alignment

The max_concurrent value registered in MindRouter must match the concurrency limit on the inference engine:

Engine	Engine Setting	MindRouter Setting
vLLM	`--max-num-seqs N`	`"max_concurrent": N`
Ollama	`OLLAMA_NUM_PARALLEL=N`	`"max_concurrent": N`

Why this matters: MindRouter uses max_concurrent to decide how many requests to route to a backend. If MindRouter thinks a backend can handle 8 concurrent requests but vLLM is configured with --max-num-seqs 4, the extra requests queue silently inside vLLM. MindRouter cannot see this hidden queue, so it continues routing requests there instead of spreading load to other backends. The result is uneven load distribution and unpredictable latency — the fair-share scheduler is effectively bypassed for those excess requests.

Context Length (num_ctx)

MindRouter automatically manages context length for Ollama backends:

Auto-discovery — During model discovery, context_length is set to min(model_max_context, 32768) to prevent small models from consuming excessive VRAM. The architectural maximum is stored separately as model_max_context.
num_ctx injection — For every Ollama request, MindRouter injects num_ctx matching the model's configured context_length.
Enforcement toggle — By default, MindRouter overrides any user-supplied num_ctx to prevent GPU memory oversubscription. Admins can toggle this in Site Settings (/admin/settings) to allow users to set their own num_ctx.
Manual override — Admins can set context_length_override per model via the admin UI to use a custom value instead of the auto-discovered one.
Ollama 0.17+ — Ollama automatically adjusts num_ctx downward if the requested value doesn't fit in GPU memory.

vLLM backends handle context length natively via --max-model-len and do not need num_ctx injection.

Retry & Failover

MindRouter automatically retries failed inference requests to improve reliability across the backend cluster.

Automatic retries — Up to 3 attempts on 5xx errors, timeouts, and connection failures. Configurable via the BACKEND_RETRY_MAX_ATTEMPTS environment variable.
Fast fail on 4xx — Client errors (400, 401, 404, etc.) are never retried and return immediately.
Streaming constraints — Retries can only occur before the first chunk is sent to the client. Once streaming has begun, a backend failure is terminal and the stream ends with an error.
Backend rotation — Each retry attempt selects a different backend when multiple backends are available for the requested model, maximizing the chance of success.

Mid-Stream Error Behavior

Once streaming has begun and the first chunk has been sent to the client, a backend failure terminates the SSE stream immediately with no error event. The client receives an abrupt end-of-stream. Retries are not possible after the first chunk because the response format has already been committed.

Circuit Breaker

MindRouter uses a per-backend circuit breaker to avoid routing requests to backends that are experiencing persistent failures.

Trip threshold — After 3 consecutive failures (configurable), the backend is marked as "open" and excluded from routing for 30 seconds (configurable).
Integration with retry — The circuit breaker works alongside retry. Backends with an open circuit are skipped during failover selection, so retries are directed to healthier backends.
Automatic recovery — After the exclusion window expires, the backend is eligible for routing again. A successful request resets the failure counter.

Rate Limiting

Note: RPM and concurrent request rate limiting are defined in the codebase but not currently enforced. The rate limiter middleware is not registered in the application. Only token quota (monthly budget) enforcement is active, returning HTTP 429 when the token budget is exceeded.

Requests per minute (RPM) — Configurable per group but not yet enforced at runtime.
Concurrent request cap — Configurable per group but not yet enforced at runtime.
Token quota (active) — Monthly token budget is enforced. Returns HTTP 429 when exceeded. Clients should implement exponential backoff.

Tool Calling

MindRouter supports tool calling (function calling) across all three API formats, with transparent translation between them.

Tool definitions — Pass an OpenAI-style tools array describing available functions, with optional tool_choice ("auto", "none", or a specific function name).
Tool results — Submit results back via role: "tool" messages with a matching tool_call_id.
Cross-format support — Tool calls work whether the request arrives in OpenAI, Ollama, or Anthropic format. MindRouter translates tool definitions, tool calls, and tool results between all formats automatically.
Backend requirement — The selected backend must support tool calling. For vLLM, this requires --enable-auto-tool-choice and the appropriate --tool-call-parser flag. For Ollama, tool calling is supported natively on compatible models.

Request ID

Every inference response includes a unique request ID for tracing and debugging.

Auto-generated IDs — MindRouter generates request IDs with format-specific prefixes: chatcmpl-* for chat completions, cmpl-* for text completions, embd-* for embeddings, etc.
Custom IDs — You can provide your own request ID by setting the X-Request-ID header. When provided, MindRouter uses your ID instead of generating one, making it easier to correlate requests across your systems.

Translation Layer

MindRouter's translation layer enables cross-engine routing: a request arriving in OpenAI, Ollama, or Anthropic format can be served by any Ollama or vLLM backend. All translation passes through a canonical internal schema.

Request Flow

OpenAI Request    --> OpenAIInTranslator    --> CanonicalChatRequest
                                                       |
Ollama Request    --> OllamaInTranslator    --> CanonicalChatRequest
                                                       |
Anthropic Request --> AnthropicInTranslator --> CanonicalChatRequest
                                                       |
                                                       v
                                                [Scheduler selects backend]
                                                       |
                           +---------------------------+-------------------+
                           v                                               v
                 OllamaOutTranslator                             VLLMOutTranslator
                 (Ollama backend)                                (vLLM backend, OpenAI format)

Canonical Schemas

The canonical internal representation (backend/app/core/canonical_schemas.py) includes:

CanonicalChatRequest — model, messages, temperature, top_p, max_tokens, stream, tools, tool_choice, response_format, think (bool or string), reasoning_effort, etc.
CanonicalMessage — role (system/user/assistant/tool), content (text or multimodal content blocks, nullable), tool_calls, tool_call_id
ContentBlock — TextContent, ImageUrlContent, or ImageBase64Content
CanonicalToolCall / CanonicalFunctionCall — tool call with id, function name, and arguments (JSON string)
CanonicalToolDefinition — tool definition with function name, description, and parameters schema
CanonicalEmbeddingRequest — model, input, encoding_format, dimensions
CanonicalChatResponse / CanonicalStreamChunk — response and streaming types (including tool call deltas)

Key Translation Mappings

Concept	OpenAI	Ollama	Anthropic	Canonical
Max tokens	`max_completion_tokens` or `max_tokens`	`options.num_predict`	`max_tokens` (required)	`max_tokens`
Stream default	`false`	`true`	`false`	—
System prompt	`messages` with `role: system`	`messages` with `role: system`	Top-level `system` field	`CanonicalMessage(role=SYSTEM)`
Stop sequences	`stop`	`options.stop`	`stop_sequences`	`stop`
JSON schema	`response_format`	`format: {schema}`	`output_config.format`	`response_format`
Parameters	Top-level fields	`options` dict	Top-level fields	Top-level fields
Images	`image_url` block	`images` array (base64)	`image` block with `source`	`ImageBase64Content` / `ImageUrlContent`
Tool definitions	`tools`	`tools`	`tools` (with `input_schema`)	`tools` (`CanonicalToolDefinition`)
Tool choice	`tool_choice`	—	`tool_choice` (`auto`/`any`/`tool`)	`tool_choice`
Tool calls	`tool_calls` (JSON string args)	`tool_calls` (dict args)	`tool_use` content blocks	`CanonicalToolCall` (JSON string args)
Tool results	`role: "tool"` + `tool_call_id`	—	`tool_result` content blocks	`CanonicalMessage(role=TOOL, tool_call_id)`
Thinking mode	`think`, `thinking.type`, `chat_template_kwargs`, `reasoning_effort`	`think` (bool or `"low"`/`"medium"`/`"high"`)	`thinking.type` (enabled/disabled)	`think` (bool or string)
User ID	`user`	—	`metadata.user_id`	`user`
Stream format	SSE (`data: {...}`)	NDJSON	SSE (Anthropic events)	`CanonicalStreamChunk`

Translators

Translator	Direction	Purpose
`OpenAIInTranslator`	API → Canonical	Translates incoming OpenAI-format requests
`OllamaInTranslator`	API → Canonical	Translates incoming Ollama-format requests
`AnthropicInTranslator`	API → Canonical	Translates incoming Anthropic Messages API requests; also formats responses and SSE stream events back to Anthropic format
`OllamaOutTranslator`	Canonical → Backend	Translates outgoing requests to Ollama backends
`VLLMOutTranslator`	Canonical → Backend	Translates outgoing requests to vLLM backends

All translators use static methods — no instantiation needed.

Model-Specific Behaviors

Qwen3-32B on vLLM — This model embeds <think>...</think> tags directly in the content field instead of using the reasoning_content field. MindRouter automatically extracts these tags and moves the reasoning text to the canonical reasoning field for both streaming and non-streaming responses.
Qwen3.5 on vLLM with thinking disabled — When thinking is disabled but the model still returns reasoning content with an empty content field, MindRouter promotes the reasoning content to the content field so the response is not blank.

Telemetry & Monitoring

GPU Sidecar Agent

Each GPU node runs a lightweight FastAPI sidecar agent (sidecar/gpu_agent.py) that exposes per-GPU hardware metrics:

Collected metrics per GPU:

Utilization (GPU % and memory %)
Memory (used/free/total GB)
Temperature (GPU and memory)
Power draw and limit (watts)
Fan speed, SM/memory clocks
Running processes (PID + memory)
Device identity (name, UUID, compute capability)
Driver and CUDA versions

Authentication: Requires SIDECAR_SECRET_KEY env var. All requests must include X-Sidecar-Key header (constant-time comparison).

Prerequisites: Each GPU node must have NVIDIA drivers and the NVIDIA Container Toolkit installed so Docker can access GPUs via --gpus all:

# RHEL/Rocky Linux
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit

# Debian/Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure and restart Docker
sudo nvidia-ctk runtime configure --driver=docker
sudo systemctl restart docker

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Deployment options:

Docker Compose — docker compose --profile gpu up gpu-sidecar
Standalone Docker — Build from sidecar/Dockerfile.sidecar, run with --gpus all
Direct Python — pip install fastapi uvicorn nvidia-ml-py && python sidecar/gpu_agent.py

Health Polling

The Backend Registry runs an adaptive polling loop:

Startup fast polls: On container start, two immediate full poll cycles run (with a 5-second gap) so backends and nodes are marked healthy within seconds of a restart instead of waiting for the normal 30-second cycle.
Normal interval: 30 seconds (configurable via BACKEND_POLL_INTERVAL)
Fast interval: 10 seconds after a backend becomes unhealthy (configurable)
Fast duration: 120 seconds before returning to normal polling

Each poll cycle has two phases:

Poll sidecar agents (one per physical node) for GPU snapshots
Poll each backend adapter for health, models, and engine-specific telemetry

Health Alerts

The admin dashboard (/admin) displays a prominent warning banner when any backend is unhealthy/unknown or any node is offline/unknown. The alert includes counts and names of affected items with direct links to the backends or nodes management pages. Intentionally disabled backends are excluded from the alert — only unexpected health issues are flagged.

Circuit Breaker

Per-backend circuit breaker protects against cascading failures:

Threshold: 3 consecutive failures before opening (configurable via BACKEND_CIRCUIT_BREAKER_THRESHOLD)
Recovery: 30 seconds before allowing a probe request (BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDS)
States: Closed (healthy) → Open (failing) → Half-Open (probe) → Closed (recovered)

Latency Tracking

Exponential Moving Average (EMA) tracks per-backend latency:

Alpha: 0.3 (30% current observation, 70% history)
Metrics: Total latency EMA and TTFT (time-to-first-token) EMA
Throughput score: 1.0 / (1.0 + latency_ms / 5000.0) — used in backend scoring
Persistence: EMAs are periodically saved to the database for recovery after restart

Prometheus Metrics

Scrape /metrics for Prometheus-compatible metrics. See the Health & Metrics Endpoints section for the full list.

Telemetry API

Admin users can access detailed telemetry via the API:

Cluster overview — All nodes, backends, GPUs with current metrics
Historical data — Time-series with configurable resolution (1m, 5m, 15m, 1h, 6h, 1d)
Per-GPU history — Individual GPU device telemetry over time
Export — Download telemetry data as JSON or CSV

Redis (Optional)

Redis is an optional dependency configured via REDIS_URL. When available, it serves three roles:

Inflight streaming token counting — Atomically tracks tokens currently being streamed across all workers, enabling accurate real-time throughput metrics.
Per-user quota token caching — Caches per-user token counters (quota:tokens:{user_id}) for fast atomic increments without hitting the database on every request.
Graceful degradation — All features that depend on Redis continue to work without it. Quota enforcement falls back to database queries, and inflight token counts default to zero. No functionality is lost; only caching benefits are reduced.

Chat System

MindRouter includes a built-in chat interface at /chat with full conversation management.

Conversations

Each user has their own conversation history
Conversations store: title, selected model, creation/update timestamps
Users can rename, switch models, or delete conversations
Up to 50 conversations shown in the sidebar (most recent first)

Messages

Messages include role (user/assistant/system) and content
Assistant messages are streamed in real-time
Messages can be edited or deleted after creation
Attachments are linked to individual messages

File Upload

Supported file types and processing:

Category	Extensions	Processing
Images	`.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`	Resized to max 1536px, compressed JPEG q85, thumbnail generated
Documents	`.pdf`	Text extracted from all pages, first-page thumbnail generated
Documents	`.docx`	Text extracted from all paragraphs
Spreadsheets	`.xlsx`	All sheets read, formatted as tab-separated text
Text files	`.txt`, `.md`, `.csv`, `.json`, `.html`, `.htm`, `.log`	Read as-is

Limits:

Max upload size: 10 MB (configurable via CHAT_UPLOAD_MAX_SIZE_MB)
Artifact storage path: /data/artifacts (configurable via ARTIFACT_STORAGE_PATH)
Artifact max size: 50 MB (configurable via ARTIFACT_MAX_SIZE_MB)
Artifact retention: 365 days

Storage layout:

/artifacts/YYYY/MM/DD/<sha256_prefix>/<full_sha256>_<uuid>.<ext>

Multimodal Model Support

Models with multimodal capability are automatically detected by name patterns (e.g., llava, -vl-, vision)
When images are sent to a multimodal model, they are included as base64-encoded content blocks
When images are sent to a non-multimodal model, they are replaced with a placeholder: [Image omitted -- model does not support multimodal input: filename]
A warning modal is shown in the chat UI when uploading images to a non-multimodal model
Admins can override multimodal detection per model via the Models admin page

Streaming

Chat responses are streamed in real-time:

Backend streaming uses NDJSON (Ollama) or SSE (vLLM/OpenAI)
The chat UI renders tokens as they arrive
TTFT (time-to-first-token) is tracked for latency monitoring
If the client disconnects, the backend request is not cancelled (to prevent DB corruption)

Conversation Retention

Conversations older than CONVERSATION_RETENTION_DAYS (default 730 days / 2 years) are automatically purged by a background task that runs every CONVERSATION_CLEANUP_INTERVAL seconds (default 86400 / once per day). Associated messages and attachments are deleted along with the conversation.

Blog / CMS

MindRouter includes a built-in blog system with public viewing and admin content management. Blog posts are written in Markdown and rendered to HTML with syntax highlighting, tables, fenced code blocks, and table of contents support.

Public Blog

Blog listing at /blog — displays all published posts, most recent first
Individual posts at /blog/{slug} — renders the post's Markdown content as styled HTML
Posts are accessible without authentication

Admin Blog Management

Admin users can manage blog posts at /admin/blog. The admin interface provides a full CMS workflow:

Action	Route	Description
List all posts	`GET /admin/blog`	View all posts (published, draft, and soft-deleted) with status indicators
New post	`GET /admin/blog/new`	Form to create a new blog post
Create post	`POST /admin/blog/new`	Submit new post with title, slug, Markdown content, excerpt, and publish toggle
Edit post	`GET /admin/blog/{id}/edit`	Edit form for an existing post
Update post	`POST /admin/blog/{id}/edit`	Save changes to a post
Toggle publish	`POST /admin/blog/{id}/publish`	Publish a draft or unpublish a live post
Delete post	`POST /admin/blog/{id}/delete`	Soft-delete a post (not permanently removed from the database)

Post Fields

Title — Display title of the post
Slug — URL-safe identifier (auto-generated from title or manually set). Used in the public URL as /blog/{slug}
Content — Full post body written in Markdown. Supports fenced code blocks with syntax highlighting, tables, and table of contents
Excerpt — Optional short summary shown in the blog listing
Published — Toggle for publish state. The published_at timestamp is set automatically when a post is first published
Author — Automatically set to the admin user who creates the post

Soft Delete

Deleting a post is a soft delete — the post is marked as deleted but remains in the database. Soft-deleted posts are hidden from the public blog listing but still visible in the admin list.

AppConfig System

MindRouter stores runtime configuration in a key-value AppConfig database table. Values are JSON-encoded and can be read or written at any time without restarting the application. This powers several features that need runtime-adjustable settings.

API

Configuration is accessed via two async CRUD functions in backend/app/db/crud.py:

get_config_json(db, key, default) — Read a config value, JSON-decoded. Returns default if the key does not exist.
set_config(db, key, value, description=None) — Upsert a config value (JSON-encoded). Creates the key if it does not exist, updates it otherwise.

Known Configuration Keys

Key	Type	Default	Description	Managed Via
`app.timezone`	string	`None`	Display timezone for the dashboard (e.g. `"America/Los_Angeles"`)	Site Settings
`ollama.enforce_num_ctx`	bool	`true`	Whether MindRouter overrides user-supplied `num_ctx` values	Site Settings
`chat.core_models`	list	`[]`	Subset of models shown in the chat interface model selector. Empty means show all.	Chat Configuration
`chat.default_model`	string	`null`	Pre-selected model in the chat interface	Chat Configuration
`chat.system_prompt`	string	`null`	Global system prompt prepended to all chat conversations. Blank removes the override.	Chat Configuration
`chat.max_tokens`	int	`16384`	Default max tokens for chat requests (range: 256–131072)	Chat Configuration
`chat.temperature`	float	`null`	Default temperature for chat requests (range: 0.0–2.0). Null means use model default.	Chat Configuration
`chat.think`	bool/string	`null`	Default thinking mode for chat: `true`, `false`, `"low"`, `"medium"`, `"high"`, or null (no default)	Chat Configuration

All AppConfig values take effect immediately upon save — no application restart is required. These settings are managed via the Admin Dashboard under Site Settings and Chat Configuration.

Configuration Reference

All settings are loaded from environment variables or .env / .env.prod files. Variable names are case-insensitive.

Application

Variable	Type	Default	Description
`APP_NAME`	str	`MindRouter`	Application name
`APP_VERSION`	str	(from pyproject.toml)	Application version (read dynamically at startup)
`DEBUG`	bool	`false`	Enable debug mode
`RELOAD`	bool	`false`	Auto-reload on code changes (development)

Database

Variable	Type	Default	Description
`DATABASE_URL`	str	`mysql+pymysql://...`	MariaDB/MySQL connection string
`DATABASE_POOL_SIZE`	int	`30`	Connection pool size
`DATABASE_MAX_OVERFLOW`	int	`20`	Max overflow connections beyond pool
`DATABASE_ECHO`	bool	`false`	Log SQL queries

Cache

Variable	Type	Default	Description
`REDIS_URL`	str	`None`	Redis connection string (optional)

Security

Variable	Type	Default	Description
`SECRET_KEY`	str	`dev-secret-key-...`	JWT/session signing key (change in production)
`JWT_ALGORITHM`	str	`HS256`	JWT signing algorithm
`JWT_EXPIRATION_HOURS`	int	`24`	JWT token lifetime
`SESSION_COOKIE_NAME`	str	`mindrouter_session`	Session cookie name
`SESSION_COOKIE_SECURE`	bool	`false`	HTTPS-only cookies
`SESSION_COOKIE_HTTPONLY`	bool	`true`	JavaScript-inaccessible cookies
`SESSION_COOKIE_SAMESITE`	str	`lax`	SameSite cookie policy
`API_KEY_HASH_ALGORITHM`	str	`argon2`	API key hashing algorithm

Azure AD SSO (Optional)

Variable	Type	Default	Description
`AZURE_AD_CLIENT_ID`	str	`None`	Azure AD application (client) ID
`AZURE_AD_CLIENT_SECRET`	str	`None`	Azure AD client secret
`AZURE_AD_TENANT_ID`	str	`None`	Azure AD tenant ID
`AZURE_AD_REDIRECT_URI`	str	`https://<host>/login/azure/authorized`	OAuth2 redirect URI
`AZURE_AD_DEFAULT_GROUP`	str	`other`	Default group for new Azure AD users

When AZURE_AD_CLIENT_ID and AZURE_AD_TENANT_ID are set, a "Sign in with Microsoft" button appears on the login page. Users are JIT-provisioned on first login — their jobTitle from Microsoft Graph determines group assignment (student/faculty/staff/other). Pre-existing accounts are linked by email.

Azure AD Group Mapping

Group assignment uses substring matching on the user's jobTitle field from Microsoft Graph:

jobTitle contains "student" → students group
jobTitle contains "faculty" or "professor" → faculty group
jobTitle contains "staff" → staff group
No match → falls back to AZURE_AD_DEFAULT_GROUP (default: other)

Web Search (Optional)

Variable	Type	Default	Description
`BRAVE_SEARCH_API_KEY`	str	`None`	Brave Search API key (enables web search in chat)
`BRAVE_SEARCH_MAX_RESULTS`	int	`5`	Maximum search results to inject as context

When configured, a search toggle appears in the chat input area. Enabling it queries the Brave Search API with the user's message and injects results into the system prompt as additional context.

Conversation Retention

Variable	Type	Default	Description
`CONVERSATION_RETENTION_DAYS`	int	`730`	Conversation retention period (2 years)
`CONVERSATION_CLEANUP_INTERVAL`	int	`86400`	Cleanup interval in seconds (24 hours)

Artifact Storage

Variable	Type	Default	Description
`ARTIFACT_STORAGE_PATH`	str	`/data/artifacts`	File storage directory
`ARTIFACT_MAX_SIZE_MB`	int	`50`	Max artifact file size
`ARTIFACT_RETENTION_DAYS`	int	`365`	Artifact retention period

Quotas

Quota defaults are now managed per-group in the database via /admin/groups. The environment variables below are deprecated (used only for initial migration seeding) and will be removed in a future release.

Variable	Type	Default	Description
`DEFAULT_TOKEN_BUDGET_STUDENT`	int	`100000`	Deprecated — use group defaults
`DEFAULT_TOKEN_BUDGET_STAFF`	int	`500000`	Deprecated — use group defaults
`DEFAULT_TOKEN_BUDGET_FACULTY`	int	`1000000`	Deprecated — use group defaults
`DEFAULT_TOKEN_BUDGET_ADMIN`	int	`10000000`	Deprecated — use group defaults
`DEFAULT_RPM_STUDENT`	int	`30`	Deprecated — use group defaults
`DEFAULT_RPM_STAFF`	int	`60`	Deprecated — use group defaults
`DEFAULT_RPM_FACULTY`	int	`120`	Deprecated — use group defaults
`DEFAULT_RPM_ADMIN`	int	`1000`	Deprecated — use group defaults
`DEFAULT_MAX_CONCURRENT_STUDENT`	int	`2`	Deprecated — use group defaults
`DEFAULT_MAX_CONCURRENT_STAFF`	int	`4`	Deprecated — use group defaults
`DEFAULT_MAX_CONCURRENT_FACULTY`	int	`8`	Deprecated — use group defaults
`DEFAULT_MAX_CONCURRENT_ADMIN`	int	`50`	Deprecated — use group defaults

Scheduler

Scheduler weights are now managed per-group in the database. The per-role weight variables below are deprecated and will be removed in a future release.

Variable	Type	Default	Description
`SCHEDULER_WEIGHT_STUDENT`	int	`1`	Deprecated — use group scheduler_weight
`SCHEDULER_WEIGHT_STAFF`	int	`2`	Deprecated — use group scheduler_weight
`SCHEDULER_WEIGHT_FACULTY`	int	`3`	Deprecated — use group scheduler_weight
`SCHEDULER_WEIGHT_ADMIN`	int	`10`	Deprecated — use group scheduler_weight
`SCHEDULER_FAIRNESS_WINDOW`	int	`300`	Fairness tracking window (seconds)
`SCHEDULER_DEPRIORITIZE_THRESHOLD`	float	`0.5`	Usage threshold for deprioritization
`SCHEDULER_SCORE_MODEL_LOADED`	int	`100`	Score bonus for pre-loaded model
`SCHEDULER_SCORE_LOW_UTILIZATION`	int	`50`	Score bonus for low GPU utilization
`SCHEDULER_SCORE_LATENCY`	int	`40`	Score factor for low latency
`SCHEDULER_SCORE_SHORT_QUEUE`	int	`30`	Score factor for short queue
`SCHEDULER_SCORE_HIGH_THROUGHPUT`	int	`20`	Score factor for high throughput

Latency Tracking

Variable	Type	Default	Description
`LATENCY_EMA_ALPHA`	float	`0.3`	EMA smoothing factor
`LATENCY_EMA_PERSIST_INTERVAL`	int	`30`	EMA persistence interval (seconds)

Backend Registry

Variable	Type	Default	Description
`BACKEND_POLL_INTERVAL`	int	`30`	Health check interval (seconds)
`BACKEND_HEALTH_TIMEOUT`	int	`5`	Health check timeout (seconds)
`BACKEND_UNHEALTHY_THRESHOLD`	int	`3`	Failed checks before marking unhealthy
`BACKEND_CIRCUIT_BREAKER_THRESHOLD`	int	`3`	Failures before circuit opens
`BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDS`	int	`30`	Circuit breaker recovery time
`BACKEND_ADAPTIVE_POLL_FAST_INTERVAL`	int	`10`	Fast poll interval after unhealthy
`BACKEND_ADAPTIVE_POLL_FAST_DURATION`	int	`120`	Duration of fast polling (seconds)

Request Handling

Variable	Type	Default	Description
`MAX_REQUEST_SIZE`	int	`52428800`	Max HTTP request body (50 MB)
`BACKEND_REQUEST_TIMEOUT`	int	`300`	Total request timeout (seconds)
`BACKEND_REQUEST_TIMEOUT_PER_ATTEMPT`	int	`180`	Per-attempt timeout (seconds, high for large model prefills)
`BACKEND_RETRY_MAX_ATTEMPTS`	int	`3`	Max total retry attempts
`BACKEND_RETRY_ATTEMPTS`	int	`2`	Default retry attempts (deprecated — not currently used by retry logic)
`BACKEND_RETRY_BACKOFF`	float	`1.0`	Retry backoff multiplier (deprecated — not currently used by retry logic)
`STRUCTURED_OUTPUT_RETRY_ON_INVALID`	bool	`true`	Retry on invalid structured output

Logging

Variable	Type	Default	Description
`LOG_LEVEL`	str	`INFO`	Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
`LOG_FORMAT`	str	`json`	Log format (`json` or `text`)
`LOG_FILE`	str	`None`	Log file path (optional, stdout if not set)

Audit Logging

Variable	Type	Default	Description
`AUDIT_LOG_ENABLED`	bool	`true`	Enable audit logging
`AUDIT_LOG_PROMPTS`	bool	`true`	Log user prompts
`AUDIT_LOG_RESPONSES`	bool	`true`	Log LLM responses

Telemetry & GPU

Variable	Type	Default	Description
`TELEMETRY_RETENTION_DAYS`	int	`30`	Telemetry data retention period
`TELEMETRY_CLEANUP_INTERVAL`	int	`3600`	Cleanup interval (seconds)
`SIDECAR_TIMEOUT`	int	`5`	Sidecar HTTP call timeout (seconds)

Observability

Variable	Type	Default	Description
`METRICS_ENABLED`	bool	`true`	Enable Prometheus metrics
`METRICS_PREFIX`	str	`mindrouter`	Metrics name prefix
`OTEL_ENABLED`	bool	`false`	Enable OpenTelemetry
`OTEL_EXPORTER_OTLP_ENDPOINT`	str	`None`	OpenTelemetry exporter endpoint

CORS

Variable	Type	Default	Description
`CORS_ORIGINS`	list	`["http://localhost:3000", "http://localhost:8000"]`	Allowed origins (JSON array or comma-separated)

Chat UI

Variable	Type	Default	Description
`CHAT_FILES_PATH`	str	`/data/chat_files`	Chat file upload directory
`CHAT_UPLOAD_MAX_SIZE_MB`	int	`10`	Max upload file size (MB)
`CHAT_UPLOAD_ALLOWED_EXTENSIONS`	list	See below	Allowed upload file extensions

Default allowed extensions: .txt, .md, .csv, .json, .html, .htm, .log, .docx, .xlsx, .pdf, .jpg, .jpeg, .png, .gif, .webp

Tokenizer

Variable	Type	Default	Description
`DEFAULT_TOKENIZER`	str	`cl100k_base`	Default tokenizer encoding

Status Enums Reference

MindRouter uses the following status enumerations throughout the system:

BackendStatus

HEALTHY — Backend is online and accepting requests
UNHEALTHY — Backend is failing health checks
DISABLED — Backend has been manually disabled by an admin
DRAINING — Backend is finishing in-flight requests but not accepting new ones
UNKNOWN — Backend status has not yet been determined

NodeStatus

ONLINE — Node sidecar is reachable and reporting metrics
OFFLINE — Node sidecar is unreachable
UNKNOWN — Node status has not yet been determined

RequestStatus

QUEUED — Request is waiting for a backend slot
PROCESSING — Request is being handled by a backend
COMPLETED — Request finished successfully
FAILED — Request encountered an error
CANCELLED — Request was cancelled (e.g., client disconnect)

Production Security Hardening

When deploying MindRouter to production, apply the following security measures:

Secure session cookies — Set SESSION_COOKIE_SECURE=True to ensure session cookies are only sent over HTTPS.
Security headers — Add Strict-Transport-Security (HSTS), X-Frame-Options, and Content-Security-Policy (CSP) headers at the reverse proxy layer (e.g., nginx).
Restrict CORS origins — Set CORS_ORIGINS to only the specific domains that need API access. Do not use * in production.

Deployment

MindRouter is designed for deployment on Linux servers with NVIDIA GPUs. The full deployment guide covers:

Rocky Linux 8 prerequisites and dependency installation
SSL/TLS configuration (self-signed and Let's Encrypt)
Apache reverse proxy setup
Firewall and SELinux configuration
Docker Compose production stack
Database migrations
GPU sidecar agent deployment
Node and backend registration
Verification and ongoing operations

GPU Sidecar Deployment

The GPU sidecar agent runs on each inference node to expose per-GPU hardware metrics and enable auto-discovery of inference endpoints. Build and deploy directly from GitHub — no need to clone the repository.

Create the env file (once per node)

sudo mkdir -p /etc/mindrouter
python3 -c "import secrets; print('SIDECAR_SECRET_KEY=' + secrets.token_hex(32))" | sudo tee /etc/mindrouter/sidecar.env
sudo chmod 600 /etc/mindrouter/sidecar.env

Build a specific version

docker build -t mindrouter-sidecar:v2.0.0 \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar

Build latest from master

docker build -t mindrouter-sidecar:latest \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git:sidecar

Run the sidecar

# Run bound to localhost only (nginx will proxy external traffic)
docker run -d --name gpu-sidecar \
  --gpus all \
  -p 127.0.0.1:18007:8007 \
  --env-file /etc/mindrouter/sidecar.env \
  --restart unless-stopped \
  mindrouter-sidecar:v2.0.0

Upgrade to a new version

docker build -t mindrouter-sidecar:v2.0.0 \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
docker stop gpu-sidecar && docker rm gpu-sidecar
docker run -d --name gpu-sidecar \
  --gpus all \
  -p 127.0.0.1:18007:8007 \
  --env-file /etc/mindrouter/sidecar.env \
  --restart unless-stopped \
  mindrouter-sidecar:v2.0.0

In production, bind the sidecar to localhost only (as shown above) and use an nginx reverse proxy on port 8007 to handle external traffic. The env file at /etc/mindrouter/sidecar.env persists the secret key across upgrades — generate it once per node and it's reused automatically.

Testing

MindRouter has a comprehensive test suite covering unit, integration, end-to-end, smoke, stress, and accessibility tests.

Quick Reference

Command	Description
`make test-unit`	Run unit tests (525+ tests)
`make test-int`	Integration tests (requires live backends)
`make test-e2e`	End-to-end tests
`make test-smoke`	Smoke tests (full API surface)
`make test-stress`	Load/stress tests
`make test-a11y`	WCAG 2.1 accessibility tests
`make test-sidecar`	GPU sidecar agent tests
`make test-all`	Run all test suites

MindRouter Documentation

Overview

Who It's For

Architecture

Getting Started

Prerequisites

Quickstart with Docker Compose

Default Development Credentials

Accessing the Application

API Reference

Interactive API Documentation

Authentication

Error Responses

Error Response Formats by API Style

Model Name Matching

OpenAI-Compatible Endpoints

Chat Completions

Embeddings

Reranking

Scoring

List Models

Ollama-Compatible Endpoints

List Models (Ollama Format)

Ollama Chat

Ollama Generate

Anthropic-Compatible Endpoint

Messages

SDK Usage (Python)

Health & Metrics Endpoints

Prometheus Metrics

Admin API Endpoints

Backend Management

Node Management

User Management

Group Management

API Key Management

Quota Management

Conversations

Queue & Audit

Telemetry

Web Dashboard

Public Pages

User Dashboard

Admin Dashboard

System Force Offline/Online

Node Lifecycle Management

Admin Models Page

Capability Overrides

Metadata Overrides

Ollama Model Pull and Delete

Admin Masquerade

Audit Log Export

Admin Chat Configuration

Chat Interface

Users, Groups & Quotas

Group System

Default Groups

User Profiles

Quota Inheritance & Overrides

API Key Lifecycle

Quota System

Rolling Budget Period

Quota Increase Requests

Backend Management

Node/Backend Model

Supported Engines

Registration

Enable/Disable/Drain/Refresh

Drain Mode

Concurrency Alignment

Context Length (num_ctx)

Scheduling & Fair Share

How It Works

Backend Scoring

Priority Gating

Retry & Failover

Mid-Stream Error Behavior

Circuit Breaker

Rate Limiting

Tool Calling