MindRouter Documentation

MindRouter is a production-ready LLM inference load balancer and translation layer that fronts a heterogeneous cluster of Ollama and vLLM inference backends. It provides a unified OpenAI-compatible API surface with native Ollama compatibility, fair-share scheduling, per-user quotas, full audit logging, and real-time GPU telemetry.

Developed by Luke Sheneman, Research Computing and Data Services (RCDS), Institute for Interdisciplinary Data Sciences (IIDS), University of Idaho.


Overview

MindRouter sits between API consumers and GPU inference servers, providing:

  • Unified API Gateway — OpenAI-compatible /v1/*, Ollama-compatible /api/*, and Anthropic-compatible /anthropic/v1/* endpoints, all backed by the same pool of inference servers.
  • Cross-Engine Routing — A request arriving in OpenAI, Ollama, or Anthropic format can be served by any backend. The translation layer handles all protocol conversion transparently.
  • Fair-Share Scheduling — Weighted Deficit Round Robin (WDRR) ensures equitable GPU access across users in different groups with configurable priorities.
  • Multi-Modal Support — Text chat, text completion, embeddings, multimodal models, structured JSON outputs, and tool calling (function calling).
  • Per-User Quotas — Token budgets, requests-per-minute limits, and concurrent request caps, with defaults inherited from the user's group.
  • Full Audit Logging — Every prompt, response, and token count is recorded for compliance and review.
  • Real-Time GPU Telemetry — Per-GPU utilization, memory, temperature, and power metrics via lightweight sidecar agents.
  • Web Dashboards — Public status page, user self-service dashboard, admin control panel, and built-in chat interface.

Who It's For

  • Research computing centers managing shared GPU clusters for multiple user groups
  • Universities providing LLM access to students, staff, and faculty with differentiated quotas
  • Organizations needing a unified API gateway across mixed Ollama/vLLM infrastructure

Architecture

MindRouter follows a layered architecture:

Client Request (OpenAI, Ollama, or Anthropic format)
        |
        v
+-----------------------------+
|     API Gateway Layer       |  ← /v1/*, /api/*, /anthropic/*, /api/admin/*
+-----------------------------+
|  Authentication & Quotas    |  ← API key verification, rate limiting
+-----------------------------+
|    Translation Layer        |  ← OpenAI/Ollama/Anthropic ↔ Canonical ↔ Ollama/vLLM
+-----------------------------+
|   Fair-Share Scheduler      |  ← WDRR with per-user deficit counters
+-----------------------------+
|    Backend Registry         |  ← Health monitoring, model tracking
+-----------------------------+
        |
        v
+---------------+-------------+
|  GPU Node 1   |  GPU Node 2 |  ...
|  +---------+  |  +--------+ |
|  | Sidecar |  |  |Sidecar | |  ← Per-node GPU metrics agent
|  +---------+  |  +--------+ |
|  | Ollama  |  |  |  vLLM  | |  ← Inference engines
|  +---------+  |  +--------+ |
+---------------+-------------+

Key concepts:

  • A Node is a physical GPU server running a sidecar agent.
  • A Backend is an inference endpoint (Ollama or vLLM instance) running on a node. Multiple backends can share a node, each assigned specific GPUs via gpu_indices.

Getting Started

Prerequisites

  • Docker and Docker Compose
  • Python 3.11+ (for local development)

Quickstart with Docker Compose

# 1. Clone and configure
git clone <repository-url>
cd mindrouter
cp .env.example .env
nano .env  # Set DATABASE_URL, SECRET_KEY, etc.

# 2. Start all services
docker compose up --build

# 3. Run database migrations
docker compose exec app alembic upgrade head

# 4. Seed development data (creates users, quotas, API keys)
docker compose exec app python scripts/seed_dev_data.py

Default Development Credentials

After running the seed script:

UsernamePasswordGroupScheduler Weight
adminadmin123Admin10
faculty1faculty123Faculty3
staff1staff123Staff2
student1student123Student1

Accessing the Application

URLDescription
http://localhost:8000/Public status page
http://localhost:8000/dashboardUser dashboard (login required)
http://localhost:8000/adminAdmin dashboard (admin group required)
http://localhost:8000/chatChat interface (login required)
http://localhost:8000/docsInteractive API docs (Swagger UI)
http://localhost:8000/redocAPI reference (ReDoc)

API Reference

Interactive API Documentation

MindRouter includes built-in interactive API documentation powered by FastAPI:

  • Swagger UI at /docs — Interactive API explorer where you can try endpoints directly from your browser. Supports authentication via the "Authorize" button (enter your API key as a Bearer token).
  • ReDoc at /redoc — Clean, readable API reference with request/response schemas and examples.

Both are auto-generated from the application's route definitions and Pydantic models, so they always reflect the current API surface.

Authentication

All inference and admin endpoints require authentication. MindRouter supports two methods:

API Key (Bearer Token):

curl -H "Authorization: Bearer mr2_your-api-key" http://localhost:8000/v1/models

API Key (Header):

curl -H "X-API-Key: mr2_your-api-key" http://localhost:8000/v1/models

Session Cookie (dashboard/admin AJAX only): Browser-based dashboard calls authenticate via the mindrouter_session cookie set at login. This is used internally by the web UI and is not intended for programmatic access.

Error Responses

All error responses follow a consistent format:

{
  "detail": "Human-readable error message"
}

Common HTTP status codes:

CodeMeaning
400Invalid request body or parameters
401Missing or invalid API key
403Insufficient permissions (e.g., non-admin accessing admin endpoint)
404Resource not found
409Conflict (duplicate name, URL, etc.)
429Rate limit exceeded
500Internal server error

Error Response Formats by API Style

Error response formats differ depending on which API style the client is using:

  • OpenAI (/v1/*) — Returns a nested error object: {"error": {"message": "...", "type": "...", "code": ...}}
  • Ollama (/api/*) — Returns a plain detail string: {"detail": "..."}
  • Anthropic (/anthropic/v1/*) — Returns a typed error object: {"type": "error", "error": {"type": "...", "message": "..."}}

Model Name Matching

Model names in requests are matched exactly against the model catalog. There is no prefix matching, alias resolution, or fuzzy matching. The model field must match a model name as listed by /v1/models or /api/tags.

OpenAI-Compatible Endpoints

These endpoints accept and return data in the OpenAI API format. Any OpenAI-compatible client or SDK can be pointed at MindRouter by changing the base URL.

MethodPathAuthDescription
POST/v1/chat/completionsAPI KeyChat completions (streaming and non-streaming)
POST/v1/completionsAPI KeyText completions (legacy)
POST/v1/embeddingsAPI KeyGenerate embeddings
POST/v1/rerankAPI KeyRerank documents against a query
POST/v1/scoreAPI KeyScore similarity between text pairs
GET/v1/modelsAPI KeyList available models

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": false
  }'

Response:

{
  "id": "chatcmpl-abc123...",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama3.2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 10,
    "total_tokens": 35
  }
}

Streaming — Set "stream": true to receive Server-Sent Events (SSE):

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "stream": true}'

Thinking/Reasoning Mode:

MindRouter supports multiple formats for controlling thinking/reasoning on models that support it (qwen3.5, qwen3, gpt-oss):

// gpt-oss: control reasoning depth
{
  "model": "openai/gpt-oss-120b",
  "messages": [{"role": "user", "content": "Solve this step by step"}],
  "reasoning_effort": "high",
  "max_completion_tokens": 16384
}

// Qwen-style: toggle thinking on/off
{
  "model": "qwen/qwen3.5-400b",
  "messages": [{"role": "user", "content": "Explain quantum computing"}],
  "chat_template_kwargs": {"enable_thinking": true},
  "max_completion_tokens": 16384
}

When thinking is enabled, the response includes reasoning_content alongside content.

Important: Thinking models can consume large numbers of output tokens for reasoning. Use max_completion_tokens (or max_tokens) to set an adequate budget — 16384 is recommended for qwen3.5-400b with thinking enabled.

Output Token Limits:

MindRouter accepts both max_completion_tokens (preferred, current OpenAI standard) and max_tokens (legacy). If both are provided, max_completion_tokens takes priority.

Structured Output (JSON Mode):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "List 3 colors as JSON"}],
  "response_format": {"type": "json_object"}
}

Structured Output (JSON Schema):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "List 3 colors"}],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "colors",
      "schema": {
        "type": "object",
        "properties": {
          "colors": {"type": "array", "items": {"type": "string"}}
        }
      }
    }
  }
}

Vision (Multimodal):

{
  "model": "llava",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}

Tool Calling (Function Calling):

{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What's the weather in Seattle?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }
  }],
  "tool_choice": "auto"
}

When the model decides to call a tool, the response includes tool_calls with finish_reason: "tool_calls". Submit the tool result back as a role: "tool" message with the matching tool_call_id.

Embeddings

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic-embed-text", "input": "Hello world"}'

Reranking

curl -X POST http://localhost:8000/v1/rerank \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Reranker-8B",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of AI.",
      "The weather is sunny today.",
      "Deep learning uses neural networks."
    ],
    "top_n": 2
  }'

Scoring

curl -X POST http://localhost:8000/v1/score \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Reranker-8B",
    "text_1": "What is machine learning?",
    "text_2": ["Machine learning is a subset of AI.", "The weather is sunny today."]
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer mr2_your-api-key"

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama3.3:70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "mindrouter",
      "capabilities": {
        "multimodal": false,
        "embeddings": false,
        "structured_output": true,
        "thinking": false
      },
      "backends": ["node1-gpu0", "node1-gpu2"],
      "context_length": 32768,
      "model_max_context": 131072,
      "parameter_count": "70B",
      "quantization": "Q4_K_M",
      "family": "llama"
    }
  ]
}

Fields include:

  • context_length — effective context window (num_ctx injected per request)
  • model_max_context — architectural maximum context the model supports
  • parameter_count — model size (e.g. "7B", "70B")
  • quantization — quantization level (e.g. "Q4_K_M", "FP16")
  • family — model family (e.g. "llama", "qwen2")
  • capabilities.thinking — whether the model supports thinking/reasoning mode

Ollama-Compatible Endpoints

These endpoints accept and return data in Ollama's native format. Ollama clients can be pointed at MindRouter as a drop-in replacement.

MethodPathAuthDescription
POST/api/chatAPI KeyOllama chat (streaming by default)
POST/api/generateAPI KeyOllama text generation
POST/api/embeddingsAPI KeyOllama embeddings
GET/api/tagsAPI KeyList models (Ollama format)

List Models (Ollama Format)

curl http://localhost:8000/api/tags \
  -H "Authorization: Bearer mr2_your-api-key"

Response:

{
  "models": [
    {
      "name": "llama3.3:70b",
      "model": "llama3.3:70b",
      "modified_at": "2026-02-28T12:00:00",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "parameter_size": "70B",
        "quantization_level": "Q4_K_M"
      },
      "context_length": 32768,
      "model_max_context": 131072
    }
  ]
}

Ollama Chat

curl -X POST http://localhost:8000/api/chat \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Note: Ollama defaults to stream: true. Set "stream": false explicitly for non-streaming responses.

Thinking/Reasoning Mode (Ollama):

Use the think field at the top level:

// Qwen-style: boolean toggle
{"model": "qwen3-32k:32b", "messages": [...], "think": true, "stream": false}

// gpt-oss: string effort level
{"model": "gpt-oss-32k:120b", "messages": [...], "think": "high", "stream": false}

The response includes a thinking field in the message. For /api/generate, thinking content appears as a top-level thinking field alongside response.

Ollama Generate

curl -X POST http://localhost:8000/api/generate \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'

Anthropic-Compatible Endpoint

This endpoint accepts and returns data in the Anthropic Messages API format. Anthropic SDK clients (Python, TypeScript) can be pointed at MindRouter by setting base_url.

MethodPathAuthDescription
POST/anthropic/v1/messagesAPI KeyAnthropic Messages API (streaming and non-streaming)

Messages

curl -X POST http://localhost:8000/anthropic/v1/messages \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "max_tokens": 500,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Response:

{
  "id": "msg_abc123...",
  "type": "message",
  "role": "assistant",
  "model": "llama3.2",
  "content": [
    {"type": "text", "text": "Hello! How can I help you today?"}
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 10,
    "output_tokens": 12
  }
}

Streaming — Set "stream": true to receive Anthropic SSE events (message_start, content_block_delta, message_stop, etc.):

curl -X POST http://localhost:8000/anthropic/v1/messages \
  -H "Authorization: Bearer mr2_your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "max_tokens": 500, "messages": [{"role": "user", "content": "Hi"}], "stream": true}'

System Prompt:

{
  "model": "llama3.2",
  "max_tokens": 500,
  "system": "You are a helpful assistant.",
  "messages": [{"role": "user", "content": "Hello!"}]
}

SDK Usage (Python)

import anthropic
client = anthropic.Anthropic(
    base_url="http://localhost:8000/anthropic",
    api_key="mr2_your-api-key",
)
message = client.messages.create(
    model="llama3.2",
    max_tokens=500,
    messages=[{"role": "user", "content": "Hello!"}],
)

Supported features:

  • System prompts (string or content block array)
  • Multimodal inputs (base64 and URL images)
  • Tool calling — tools with input_schema, tool_choice (auto/any/tool), tool_use/tool_result content blocks, streaming tool use with input_json_delta
  • Thinking/reasoning mode (thinking.type: enabled, adaptive, disabled)
  • Structured output via output_config.format with type: "json_schema"
  • Parameters: max_tokens, temperature, top_p, top_k, stop_sequences, stream
  • metadata.user_id mapping

Note: This is inbound-only — there are no Anthropic backends. Requests are translated to canonical format and routed to Ollama/vLLM backends like any other request.

Health & Metrics Endpoints

These endpoints are unauthenticated and intended for monitoring infrastructure.

MethodPathAuthDescription
GET/healthzNoneLiveness probe (always 200 if app is running)
GET/readyzNoneReadiness probe (checks DB + healthy backends)
GET/metricsNonePrometheus metrics (text format)
GET/statusNoneCluster status summary (JSON)
GET/api/cluster/total-tokensNoneTotal tokens ever served (cached 10s)
GET/api/cluster/trendsNoneToken and active-user trends (?range=hour|day|week|month|year)
GET/api/cluster/throughputNoneToken throughput, requests/min, active requests

Prometheus Metrics

The /metrics endpoint exposes the following Prometheus metrics:

MetricTypeLabelsDescription
mindrouter_requests_totalCounterendpoint, statusTotal requests processed
mindrouter_request_latency_secondsHistogramendpointRequest latency
mindrouter_queue_sizeGaugeCurrent scheduler queue depth
mindrouter_active_backendsGaugeNumber of healthy backends
mindrouter_tokens_totalCountertype (prompt/completion)Total tokens processed

Admin API Endpoints

All admin endpoints require membership in a group with is_admin = true. They are mounted under /api/admin/.

Backend Management

MethodPathDescription
POST/api/admin/backends/registerRegister a new inference backend
GET/api/admin/backendsList all backends
PATCH/api/admin/backends/{id}Update backend properties
POST/api/admin/backends/{id}/disableDisable a backend
POST/api/admin/backends/{id}/enableEnable a disabled backend
POST/admin/backends/{id}/drainInitiate graceful drain (stops new requests, waits for in-flight to complete, then disables). Dashboard route only, not available via API.
POST/api/admin/backends/{id}/refreshForce-refresh capabilities and models
POST/api/admin/backends/{id}/ollama/pullStart pulling a model to an Ollama backend
GET/api/admin/backends/{id}/ollama/pull/{job_id}Poll progress of an Ollama model pull
POST/api/admin/backends/{id}/ollama/deleteDelete a model from an Ollama backend

Node Management

MethodPathDescription
POST/api/admin/nodes/registerRegister a new GPU node
GET/api/admin/nodesList all nodes
PATCH/api/admin/nodes/{id}Update node properties
DELETE/api/admin/nodes/{id}Delete a node (fails if backends reference it)
POST/api/admin/nodes/{id}/refreshForce-refresh sidecar data

User Management

MethodPathDescription
GET/api/admin/usersList users (filterable by group, searchable by name/email)
POST/api/admin/usersCreate a new user with group-based quota defaults
GET/api/admin/users/{id}User detail with usage stats, API keys, monthly usage
PATCH/api/admin/users/{id}Update user profile, group, and quota overrides
DELETE/api/admin/users/{id}Hard-delete a user and all associated data
POST/api/admin/users/{id}/api-keysCreate an API key for a user

Group Management

MethodPathDescription
GET/api/admin/groupsList all groups with user counts
POST/api/admin/groupsCreate a new group with quota defaults
PATCH/api/admin/groups/{id}Update group defaults (token budget, RPM, weight, etc.)
DELETE/api/admin/groups/{id}Delete a group (fails if users are assigned)

API Key Management

MethodPathDescription
GET/api/admin/api-keysList all API keys across users (searchable, filterable by status)

Quota Management

MethodPathDescription
GET/api/admin/quota-requestsList pending quota increase requests
POST/api/admin/quota-requests/{id}/reviewApprove or deny a quota request

Conversations

MethodPathDescription
GET/admin/conversations/exportExport conversations as CSV or JSON (form-based, filterable)
GET/api/admin/conversations/exportBulk API export (JSON with content, for programmatic access)

Queue & Audit

MethodPathDescription
GET/api/admin/queueScheduler queue statistics
GET/api/admin/audit/searchSearch audit logs (filter by user, model, status, date, text)
GET/api/admin/audit/{id}Full audit detail including prompt and response content

Telemetry

MethodPathDescription
GET/api/admin/telemetry/overviewCluster-wide telemetry (nodes, backends, GPUs)
GET/api/admin/telemetry/latestLightweight polling endpoint for dashboard
GET/api/admin/telemetry/backends/{id}/historyTime-series telemetry for a backend
GET/api/admin/telemetry/gpus/{id}/historyTime-series telemetry for a GPU device
GET/api/admin/telemetry/nodes/{id}/historyAggregated time-series for a node (all GPUs)
GET/api/admin/telemetry/exportExport raw telemetry as JSON or CSV

Web Dashboard

MindRouter includes a full web dashboard built with Bootstrap 5. All pages extend a common base template with navigation and accessibility features (WCAG 2.1 Level AA).

Public Pages

PageURLDescription
Cluster Status/Shows healthy backend count, available models, queue size, and overall cluster status
Login/loginUsername/password authentication (Azure AD SSO when configured)
Blog/blogPublic blog listing with published posts
Blog Post/blog/{slug}Individual blog post rendered from Markdown

User Dashboard

PageURLDescription
Dashboard/dashboardToken usage progress bar, active API keys, quota usage history, change password
Change PasswordPOST /dashboard/change-passwordChange password for local (non-SSO) accounts. Requires current password, new password (min 8 chars), and confirmation. Not available for Azure AD SSO users.
Request Quota/dashboard/request-quotaSubmit a quota increase request with justification
Key Created(after creation)Displays the full API key once (copy-to-clipboard)

The user dashboard includes:

  • Dark mode toggle — saved to browser localStorage, persists across sessions
  • Live token usage — usage statistics poll every 1 second for real-time feedback without page refresh
  • Lifetime vs rolling usage — displays both Lifetime Token Usage (all-time total, never resets) and Current Period Usage (resets when the budget period rolls over)
  • Quota details — current RPM limit and max concurrent requests shown in the Quota Details card
  • API key expiration warnings — keys within 7 days of expiration show a yellow countdown; expired keys display an “Expired” badge; “Last Used” column shows last authentication time

Admin Dashboard

The admin dashboard has a persistent sidebar with links to all admin pages. Access requires membership in a group with is_admin = true.

PageURLDescription
Overview/adminSystem metrics overview, health alert banner for unhealthy backends/offline nodes, pending request badges, system force offline/online toggle
Backends/admin/backendsBackend health, models, enable/disable/drain controls
Nodes/admin/nodesGPU node management, sidecar status, hardware specs, take offline/bring online/force drain controls
Models/admin/modelsModel management: multimodal and thinking overrides, full metadata overrides (family, parameters, quantization, context length, etc.), Ollama pull/delete
GPU Metrics/admin/metricsReal-time GPU utilization, memory, temperature, power charts with time range controls
Users/admin/usersUser accounts with search, group filter, and pagination
User Detail/admin/users/{id}Individual user profile, usage stats, API keys, monthly usage chart, quota overrides, masquerade (view dashboard as user)
Groups/admin/groupsGroup management: create, edit, delete groups with quota defaults and scheduler weights
API Keys/admin/api-keysAll API keys across users with search, status filter, and pagination
Requests/admin/requestsPending API key and quota increase requests, approve/deny
Audit Log/admin/auditInference request history with filtering, search, and CSV/JSON export
Conversations/admin/conversationsBrowse and export all user conversations
Chat Configuration/admin/chat-configConfigure chat UI defaults: core models, default model, system prompt, max tokens, temperature, thinking mode
Blog Management/admin/blogBlog CMS: create, edit, publish/unpublish, and delete blog posts
Site Settings/admin/settingsGlobal settings: display timezone, Ollama context length enforcement

System Force Offline/Online

Administrators can force the entire MindRouter system offline from the admin overview page (/admin). This is useful for planned maintenance windows or emergency situations.

  • Force Offline (POST /admin/system/toggle-online) — Stops the backend polling loop, marks all backends as unhealthy in the database, and sets an internal _force_offline flag. While offline, no health checks or discovery runs, and no inference requests can be served.
  • Force Online — Clears the offline flag, closes and reloads all backend adapters and sidecar clients from the database, restarts the polling loop, and immediately runs a full poll cycle to restore backend health status.

The toggle is available as a button on the admin overview page. The system status is visible to all users on the public status page.

Node Lifecycle Management

Beyond basic node registration and editing, administrators can manage node operational state from the Nodes page (/admin/nodes):

ActionRouteDescription
Take OfflinePOST /admin/nodes/{id}/take-offlineDisables all backends on the node and marks the node status as OFFLINE. New requests will not be routed to any backend on this node.
Bring OnlinePOST /admin/nodes/{id}/bring-onlineRe-enables all backends on the node and marks the node status as ONLINE. Backends will begin receiving requests again after the next health poll.
View Active RequestsGET /admin/nodes/{id}/active-requestsReturns a JSON count of in-flight requests across all backends on the node. Useful for monitoring drain progress before taking a node offline.
Force DrainPOST /admin/nodes/{id}/force-drainForce-cancels all active (in-flight) requests on the node's backends. Returns the number of cancelled requests. Use this when you need to take a node offline immediately without waiting for requests to complete naturally.

Recommended workflow for node maintenance:

  1. Take the node offline to stop new requests from being routed there.
  2. Monitor active requests using the active requests endpoint until the count reaches zero.
  3. If requests are stuck, use force drain to cancel them.
  4. Perform maintenance (upgrade vLLM, restart Ollama, etc.).
  5. Bring the node back online.

Admin Models Page

The Models page at /admin/models provides comprehensive model management. Models are grouped by name (since the same model may appear on multiple backends) and presented with their current metadata and override status.

Capability Overrides

MindRouter auto-detects model capabilities during discovery, but admins can override the detected values:

  • Multimodal override — Toggle or reset the supports_multimodal flag for all instances of a model. Use this when auto-detection incorrectly classifies a model (e.g., a vision model whose name does not match the usual patterns).
  • Thinking override — Toggle or reset the supports_thinking flag for all instances. Use this for models that support thinking/reasoning mode but are not detected automatically.
  • Resetting an override returns the field to auto-detected values from the next discovery cycle.

Metadata Overrides

Admins can override any model metadata field across all instances of a model via the metadata edit form. Overridable fields:

FieldTypeDescription
familystringModel family (e.g., llama, qwen2)
parameter_countstringModel size (e.g., 7B, 70B)
quantizationstringQuantization level (e.g., Q4_K_M, FP16)
context_lengthintEffective context window (injected as num_ctx). Overrides the auto-discovered value.
embedding_lengthintEmbedding dimension length
head_countintNumber of attention heads
layer_countintNumber of transformer layers
feed_forward_lengthintFeed-forward network dimension
model_formatstringModel format (e.g., gguf, safetensors)
parent_modelstringParent model identifier
capabilitiesstringComma-separated list of capabilities (stored as JSON array)
descriptionstringHuman-readable model description
model_urlstringURL to model documentation or repository

Clearing a field (setting it to blank) removes the override and reverts to the auto-discovered value. Admins can also reset all overrides at once via the "Reset All Overrides" action.

Ollama Model Pull and Delete

The Models page lists all Ollama backends and provides UI controls to:

  • Pull a model — Trigger a model download on a specific Ollama backend via the admin API (POST /api/admin/backends/{id}/ollama/pull). Progress can be polled until completion.
  • Delete a model — Remove a model from a specific Ollama backend via the admin API (POST /api/admin/backends/{id}/ollama/delete).

Admin Masquerade

Administrators can "masquerade" as another user to view the user dashboard from that user's perspective. This is useful for troubleshooting user-reported issues or verifying quota and usage display.

  • Start masquerade (POST /admin/masquerade/{target_user_id}) — Sets a signed cookie (mindrouter_masquerade) containing the target user ID. The cookie is signed using itsdangerous.URLSafeTimedSerializer with the application's SECRET_KEY and a "masquerade" salt. The cookie expires after 24 hours.
  • During masquerade — The user dashboard (/dashboard) shows the target user's data (usage stats, API keys, quota, conversations) instead of the admin's own data. The masquerade applies only to read-only dashboard views — admin routes and actions are never affected.
  • Stop masquerade (POST /admin/masquerade/stop) — Deletes the masquerade cookie and redirects back to the admin users page.
  • Security — Only users in an admin group can start a masquerade. The target user ID is verified to exist before the cookie is set. The signed cookie prevents tampering.

Masquerade is initiated from the user detail page (/admin/users/{id}) via a "View as User" button.

Audit Log Export

Administrators can export audit log data as CSV or JSON via GET /admin/audit/export. The export supports the same filters as the audit log UI:

ParameterTypeDescription
formatstringOutput format: csv (default) or json
searchstringFree-text search across request fields
user_id_filterintFilter by user ID
model_filterstringFilter by model name
status_filterstringFilter by request status
start_datestringStart date (ISO 8601 format)
end_datestringEnd date (ISO 8601 format)
include_contentboolInclude prompt/response content in export (default: false)

Export fields: request_uuid, created_at, user_id, model, endpoint, status, prompt_tokens, completion_tokens, total_tokens, total_time_ms, error_message. When include_content=true, additional fields are included: messages, prompt, parameters, response_content, finish_reason.

Exports are limited to 10,000 records. The file is downloaded as an attachment (audit_log.csv or audit_log.json).

Admin Chat Configuration

The Chat Configuration page at /admin/chat-config allows administrators to control default settings for the built-in chat interface. All settings are stored in the AppConfig database and take effect immediately.

SettingDescriptionDetails
Core ModelsSubset of models shown in the chat model selectorSelect from all available (non-embedding) models on healthy backends. When set, only these models appear in the chat UI dropdown. Empty means show all models.
Default ModelPre-selected model in the chat UIThe model automatically selected when a user starts a new conversation. Blank means no default.
System PromptGlobal system prompt for all chat conversationsPrepended to every chat conversation. Blank removes the override and reverts to the built-in default. Can also be explicitly reset via a "Reset" button.
Max TokensDefault max tokens for chat requestsRange: 256–131072. Default: 16384. Clamped to the allowed range on save.
TemperatureDefault temperature for chat responsesRange: 0.0–2.0. Blank means use the model's default temperature.
Thinking ModeDefault thinking/reasoning modeOptions: true, false, low, medium, high, or blank (no default). Controls whether thinking is enabled by default for models that support it.

Chat Interface

PageURLDescription
Chat/chatFull-featured chat UI with model selection, streaming, file upload, multimodal support

The chat interface supports:

  • Collapsible conversation sidebar
  • Model and backend selection
  • Real-time streaming responses
  • File upload via button or drag-and-drop anywhere in the chat window (images, PDFs, DOCX, XLSX, CSV, JSON, Markdown, etc.)
  • Vision model support with automatic image handling
  • Web search toggle — when enabled, queries the Brave Search API and injects results as context into the system prompt (requires BRAVE_SEARCH_API_KEY)
  • Code syntax highlighting
  • LaTeX rendering
  • Message editing and deletion
  • Dark mode toggle (stored in browser localStorage)
  • Advanced models toggle — show or hide non-core models in the model selector
  • Per-request thinking controls — enable/disable thinking mode and set reasoning effort for supported models
  • Collapsible thinking blocks in assistant responses
  • Keyboard shortcuts — Shift+Enter for newline, Enter to send
  • Sidebar collapse for a wider chat area
  • Copy buttons on code blocks and assistant messages
  • Image lightbox for viewing uploaded images full-size
  • Auto-titling of conversations based on the first message

Users, Groups & Quotas

Group System

MindRouter uses a database-driven group system for authorization, quota defaults, and scheduler weights. Each user belongs to exactly one group. Groups are fully manageable via the admin UI at /admin/groups.

Admin privileges are determined by the group's is_admin flag. Users in an admin group can access the admin dashboard and admin API endpoints.

Default Groups

GroupToken BudgetRPMMax ConcurrentSched. WeightAdmin
students100,0003021No
staff500,0006042No
faculty1,000,00012083No
researchers1,000,00012083No
admin10,000,0001,0005010Yes
nerds500,0006042No
other100,0003021No

These 7 groups are created automatically by the database migration. Admins can create additional groups, edit defaults, or delete empty groups via the admin UI or API.

User Profiles

Each user has the following profile fields (editable by admins at /admin/users/{id}):

  • Username and Email — unique identifiers
  • Full Name — display name
  • Group — determines default quotas, scheduler weight, and admin access
  • College and Department — organizational affiliation
  • Intended Use — free-text description of how the user plans to use the service

Quota Inheritance & Overrides

When a user is created, their quota defaults are inherited from their group. Admins can override any quota value per-user:

  • Token budget — inherited from group, overridable per user
  • RPM limit — inherited from group, overridable per user
  • Max concurrent — inherited from group, overridable per user
  • Weight override — if set, overrides the group's scheduler weight for this user

API Key Lifecycle

  1. Generation — Keys use the format mr2_<random_urlsafe_base64> (48+ characters total).
  2. Storage — The raw key is shown once at creation. Only the Argon2 hash and a prefix (mr2_<first 8 chars>) are stored in the database.
  3. Verification — Lookup by prefix (fast), then full Argon2 hash verification.
  4. Expiration — Optional expires_at timestamp.
  5. Revocation — Keys can be revoked (soft-delete) without deleting the audit trail.
  6. Usage trackinglast_used_at and usage_count updated atomically on each request.

Quota System

Each user has a quota record with:

  • Token budget — Total tokens allowed per period. Deducted on each completed request (prompt + completion tokens).
  • RPM limit — Maximum requests per minute.
  • Max concurrent — Maximum simultaneous in-flight requests.

When a quota is exceeded, the request is rejected with HTTP 429.

Rolling Budget Period

Token budgets use rolling 30-day windows per user, not calendar months. Usage is tracked continuously and the oldest usage falls off as the window advances.

Quota Increase Requests

Users can submit quota increase requests via the dashboard (/dashboard/request-quota). The request includes:

  • Desired token budget
  • Written justification

Admins review requests at /admin/requests or via POST /api/admin/quota-requests/{id}/review, which can approve (with a custom granted token amount) or deny the request.


Backend Management

Node/Backend Model

MindRouter separates the concept of physical GPU servers (Nodes) from inference endpoints (Backends):

  • A Node represents a physical server with GPUs and a sidecar agent.
  • A Backend is an Ollama or vLLM instance running on a node.
  • One node can host multiple backends, each assigned specific GPUs via gpu_indices.
  • Backends without a node_id work as standalone endpoints (no GPU telemetry).
Node: gpu-server-1 (4x A100-80GB, sidecar at :8007)
+-- Backend: vllm-large  (gpu_indices: [0, 1])  ← uses GPUs 0-1
+-- Backend: vllm-small  (gpu_indices: [2])      ← uses GPU 2
+-- Backend: ollama-misc (gpu_indices: [3])      ← uses GPU 3

Supported Engines

EngineHealth CheckModel DiscoveryTelemetry Source
OllamaGET /api/tagsGET /api/tags + POST /api/ps (loaded models)Sidecar agent
vLLMGET /health (fallback: GET /v1/models)GET /v1/modelsGET /metrics (Prometheus format)

Registration

Register a node:

curl -X POST http://localhost:8000/api/admin/nodes/register \
  -H "Authorization: Bearer admin-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpu-server-1",
    "hostname": "gpu1.example.com",
    "sidecar_url": "http://gpu1.example.com:8007",
    "sidecar_key": "your-sidecar-secret-key"
  }'

Register a backend on that node:

curl -X POST http://localhost:8000/api/admin/backends/register \
  -H "Authorization: Bearer admin-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ollama-gpu1",
    "url": "http://gpu1.example.com:11434",
    "engine": "ollama",
    "max_concurrent": 4,
    "node_id": 1,
    "gpu_indices": [0, 1]
  }'

Enable/Disable/Drain/Refresh

  • Disable a backend to take it out of rotation without deleting it: POST /api/admin/backends/{id}/disable
  • Enable to bring it back: POST /api/admin/backends/{id}/enable
  • Drain for graceful offline maintenance: POST /admin/backends/{id}/drain (dashboard-only route)
  • Refresh to force re-discovery of models and capabilities: POST /api/admin/backends/{id}/refresh

Drain Mode

Drain mode provides graceful backend shutdown for maintenance. When a backend is set to draining:

  1. The scheduler stops routing new requests to the backend.
  2. All in-flight requests are allowed to complete normally.
  3. When the backend's queue depth reaches 0, it automatically transitions to disabled.

This avoids abruptly killing active requests when you need to restart an inference engine, upgrade models, or perform node maintenance. Use it before upgrading vLLM, restarting Ollama, or taking a GPU node offline.

Concurrency Alignment

The max_concurrent value registered in MindRouter must match the concurrency limit on the inference engine:

EngineEngine SettingMindRouter Setting
vLLM--max-num-seqs N"max_concurrent": N
OllamaOLLAMA_NUM_PARALLEL=N"max_concurrent": N

Why this matters: MindRouter uses max_concurrent to decide how many requests to route to a backend. If MindRouter thinks a backend can handle 8 concurrent requests but vLLM is configured with --max-num-seqs 4, the extra requests queue silently inside vLLM. MindRouter cannot see this hidden queue, so it continues routing requests there instead of spreading load to other backends. The result is uneven load distribution and unpredictable latency — the fair-share scheduler is effectively bypassed for those excess requests.

Context Length (num_ctx)

MindRouter automatically manages context length for Ollama backends:

  • Auto-discovery — During model discovery, context_length is set to min(model_max_context, 32768) to prevent small models from consuming excessive VRAM. The architectural maximum is stored separately as model_max_context.
  • num_ctx injection — For every Ollama request, MindRouter injects num_ctx matching the model's configured context_length.
  • Enforcement toggle — By default, MindRouter overrides any user-supplied num_ctx to prevent GPU memory oversubscription. Admins can toggle this in Site Settings (/admin/settings) to allow users to set their own num_ctx.
  • Manual override — Admins can set context_length_override per model via the admin UI to use a custom value instead of the auto-discovered one.
  • Ollama 0.17+ — Ollama automatically adjusts num_ctx downward if the requested value doesn't fit in GPU memory.

vLLM backends handle context length natively via --max-model-len and do not need num_ctx injection.


Scheduling & Fair Share

MindRouter implements Weighted Deficit Round Robin (WDRR) to ensure fair GPU access across users.

How It Works

  1. Share weights are determined by the user's group (e.g., students=1, staff=2, faculty=3, admin=10). Per-user weight overrides are supported via the quota system.
  2. Each user has a deficit counter that tracks how much service debt they've accrued.
  3. On each scheduling round, users with the highest deficit (most underserved) are served first.
  4. Burst credits allow full cluster utilization when the cluster is idle.
  5. Heavy user deprioritization kicks in when a user exceeds their fair share within the fairness window.

Backend Scoring

When multiple backends can serve a request, the scheduler scores them on:

  • Model already loaded (+100 points) — avoids cold-loading the model
  • Low GPU utilization (+50 points) — prefers idle GPUs
  • Low latency (+40 points) — based on EMA of recent response times
  • Short queue (+30 points) — prefers backends with fewer queued requests
  • High throughput (+20 points) — based on recent tokens/second
  • Priority (+N × 10 points) — from the backend's configured priority value

Hard constraints (multimodal capability, embedding support, model availability) are checked before soft scoring.

Priority Gating

When multiple requests are waiting for the same model, the scheduler ensures the highest-priority waiter proceeds first. Priority is determined by the user's group scheduler weight and deficit counter, preventing lower-priority requests from starving higher-priority ones during contention.


Retry & Failover

MindRouter automatically retries failed inference requests to improve reliability across the backend cluster.

  • Automatic retries — Up to 3 attempts on 5xx errors, timeouts, and connection failures. Configurable via the BACKEND_RETRY_MAX_ATTEMPTS environment variable.
  • Fast fail on 4xx — Client errors (400, 401, 404, etc.) are never retried and return immediately.
  • Streaming constraints — Retries can only occur before the first chunk is sent to the client. Once streaming has begun, a backend failure is terminal and the stream ends with an error.
  • Backend rotation — Each retry attempt selects a different backend when multiple backends are available for the requested model, maximizing the chance of success.

Mid-Stream Error Behavior

Once streaming has begun and the first chunk has been sent to the client, a backend failure terminates the SSE stream immediately with no error event. The client receives an abrupt end-of-stream. Retries are not possible after the first chunk because the response format has already been committed.


Circuit Breaker

MindRouter uses a per-backend circuit breaker to avoid routing requests to backends that are experiencing persistent failures.

  • Trip threshold — After 3 consecutive failures (configurable), the backend is marked as "open" and excluded from routing for 30 seconds (configurable).
  • Integration with retry — The circuit breaker works alongside retry. Backends with an open circuit are skipped during failover selection, so retries are directed to healthier backends.
  • Automatic recovery — After the exclusion window expires, the backend is eligible for routing again. A successful request resets the failure counter.

Rate Limiting

Note: RPM and concurrent request rate limiting are defined in the codebase but not currently enforced. The rate limiter middleware is not registered in the application. Only token quota (monthly budget) enforcement is active, returning HTTP 429 when the token budget is exceeded.

  • Requests per minute (RPM) — Configurable per group but not yet enforced at runtime.
  • Concurrent request cap — Configurable per group but not yet enforced at runtime.
  • Token quota (active) — Monthly token budget is enforced. Returns HTTP 429 when exceeded. Clients should implement exponential backoff.

Tool Calling

MindRouter supports tool calling (function calling) across all three API formats, with transparent translation between them.

  • Tool definitions — Pass an OpenAI-style tools array describing available functions, with optional tool_choice ("auto", "none", or a specific function name).
  • Tool results — Submit results back via role: "tool" messages with a matching tool_call_id.
  • Cross-format support — Tool calls work whether the request arrives in OpenAI, Ollama, or Anthropic format. MindRouter translates tool definitions, tool calls, and tool results between all formats automatically.
  • Backend requirement — The selected backend must support tool calling. For vLLM, this requires --enable-auto-tool-choice and the appropriate --tool-call-parser flag. For Ollama, tool calling is supported natively on compatible models.

Request ID

Every inference response includes a unique request ID for tracing and debugging.

  • Auto-generated IDs — MindRouter generates request IDs with format-specific prefixes: chatcmpl-* for chat completions, cmpl-* for text completions, embd-* for embeddings, etc.
  • Custom IDs — You can provide your own request ID by setting the X-Request-ID header. When provided, MindRouter uses your ID instead of generating one, making it easier to correlate requests across your systems.

Translation Layer

MindRouter's translation layer enables cross-engine routing: a request arriving in OpenAI, Ollama, or Anthropic format can be served by any Ollama or vLLM backend. All translation passes through a canonical internal schema.

Request Flow

OpenAI Request    --> OpenAIInTranslator    --> CanonicalChatRequest
                                                       |
Ollama Request    --> OllamaInTranslator    --> CanonicalChatRequest
                                                       |
Anthropic Request --> AnthropicInTranslator --> CanonicalChatRequest
                                                       |
                                                       v
                                                [Scheduler selects backend]
                                                       |
                           +---------------------------+-------------------+
                           v                                               v
                 OllamaOutTranslator                             VLLMOutTranslator
                 (Ollama backend)                                (vLLM backend, OpenAI format)

Canonical Schemas

The canonical internal representation (backend/app/core/canonical_schemas.py) includes:

  • CanonicalChatRequest — model, messages, temperature, top_p, max_tokens, stream, tools, tool_choice, response_format, think (bool or string), reasoning_effort, etc.
  • CanonicalMessage — role (system/user/assistant/tool), content (text or multimodal content blocks, nullable), tool_calls, tool_call_id
  • ContentBlock — TextContent, ImageUrlContent, or ImageBase64Content
  • CanonicalToolCall / CanonicalFunctionCall — tool call with id, function name, and arguments (JSON string)
  • CanonicalToolDefinition — tool definition with function name, description, and parameters schema
  • CanonicalEmbeddingRequest — model, input, encoding_format, dimensions
  • CanonicalChatResponse / CanonicalStreamChunk — response and streaming types (including tool call deltas)

Key Translation Mappings

ConceptOpenAIOllamaAnthropicCanonical
Max tokensmax_completion_tokens or max_tokensoptions.num_predictmax_tokens (required)max_tokens
Stream defaultfalsetruefalse
System promptmessages with role: systemmessages with role: systemTop-level system fieldCanonicalMessage(role=SYSTEM)
Stop sequencesstopoptions.stopstop_sequencesstop
JSON schemaresponse_formatformat: {schema}output_config.formatresponse_format
ParametersTop-level fieldsoptions dictTop-level fieldsTop-level fields
Imagesimage_url blockimages array (base64)image block with sourceImageBase64Content / ImageUrlContent
Tool definitionstoolstoolstools (with input_schema)tools (CanonicalToolDefinition)
Tool choicetool_choicetool_choice (auto/any/tool)tool_choice
Tool callstool_calls (JSON string args)tool_calls (dict args)tool_use content blocksCanonicalToolCall (JSON string args)
Tool resultsrole: "tool" + tool_call_idtool_result content blocksCanonicalMessage(role=TOOL, tool_call_id)
Thinking modethink, thinking.type, chat_template_kwargs, reasoning_effortthink (bool or "low"/"medium"/"high")thinking.type (enabled/disabled)think (bool or string)
User IDusermetadata.user_iduser
Stream formatSSE (data: {...})NDJSONSSE (Anthropic events)CanonicalStreamChunk

Translators

TranslatorDirectionPurpose
OpenAIInTranslatorAPI → CanonicalTranslates incoming OpenAI-format requests
OllamaInTranslatorAPI → CanonicalTranslates incoming Ollama-format requests
AnthropicInTranslatorAPI → CanonicalTranslates incoming Anthropic Messages API requests; also formats responses and SSE stream events back to Anthropic format
OllamaOutTranslatorCanonical → BackendTranslates outgoing requests to Ollama backends
VLLMOutTranslatorCanonical → BackendTranslates outgoing requests to vLLM backends

All translators use static methods — no instantiation needed.

Model-Specific Behaviors

  • Qwen3-32B on vLLM — This model embeds <think>...</think> tags directly in the content field instead of using the reasoning_content field. MindRouter automatically extracts these tags and moves the reasoning text to the canonical reasoning field for both streaming and non-streaming responses.
  • Qwen3.5 on vLLM with thinking disabled — When thinking is disabled but the model still returns reasoning content with an empty content field, MindRouter promotes the reasoning content to the content field so the response is not blank.

Telemetry & Monitoring

GPU Sidecar Agent

Each GPU node runs a lightweight FastAPI sidecar agent (sidecar/gpu_agent.py) that exposes per-GPU hardware metrics:

Collected metrics per GPU:

  • Utilization (GPU % and memory %)
  • Memory (used/free/total GB)
  • Temperature (GPU and memory)
  • Power draw and limit (watts)
  • Fan speed, SM/memory clocks
  • Running processes (PID + memory)
  • Device identity (name, UUID, compute capability)
  • Driver and CUDA versions

Authentication: Requires SIDECAR_SECRET_KEY env var. All requests must include X-Sidecar-Key header (constant-time comparison).

Prerequisites: Each GPU node must have NVIDIA drivers and the NVIDIA Container Toolkit installed so Docker can access GPUs via --gpus all:

# RHEL/Rocky Linux
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
  | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit

# Debian/Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure and restart Docker
sudo nvidia-ctk runtime configure --driver=docker
sudo systemctl restart docker

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Deployment options:

  1. Docker Composedocker compose --profile gpu up gpu-sidecar
  2. Standalone Docker — Build from sidecar/Dockerfile.sidecar, run with --gpus all
  3. Direct Pythonpip install fastapi uvicorn nvidia-ml-py && python sidecar/gpu_agent.py

Health Polling

The Backend Registry runs an adaptive polling loop:

  • Startup fast polls: On container start, two immediate full poll cycles run (with a 5-second gap) so backends and nodes are marked healthy within seconds of a restart instead of waiting for the normal 30-second cycle.
  • Normal interval: 30 seconds (configurable via BACKEND_POLL_INTERVAL)
  • Fast interval: 10 seconds after a backend becomes unhealthy (configurable)
  • Fast duration: 120 seconds before returning to normal polling

Each poll cycle has two phases:

  1. Poll sidecar agents (one per physical node) for GPU snapshots
  2. Poll each backend adapter for health, models, and engine-specific telemetry

Health Alerts

The admin dashboard (/admin) displays a prominent warning banner when any backend is unhealthy/unknown or any node is offline/unknown. The alert includes counts and names of affected items with direct links to the backends or nodes management pages. Intentionally disabled backends are excluded from the alert — only unexpected health issues are flagged.

Circuit Breaker

Per-backend circuit breaker protects against cascading failures:

  • Threshold: 3 consecutive failures before opening (configurable via BACKEND_CIRCUIT_BREAKER_THRESHOLD)
  • Recovery: 30 seconds before allowing a probe request (BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDS)
  • States: Closed (healthy) → Open (failing) → Half-Open (probe) → Closed (recovered)

Latency Tracking

Exponential Moving Average (EMA) tracks per-backend latency:

  • Alpha: 0.3 (30% current observation, 70% history)
  • Metrics: Total latency EMA and TTFT (time-to-first-token) EMA
  • Throughput score: 1.0 / (1.0 + latency_ms / 5000.0) — used in backend scoring
  • Persistence: EMAs are periodically saved to the database for recovery after restart

Prometheus Metrics

Scrape /metrics for Prometheus-compatible metrics. See the Health & Metrics Endpoints section for the full list.

Telemetry API

Admin users can access detailed telemetry via the API:

  • Cluster overview — All nodes, backends, GPUs with current metrics
  • Historical data — Time-series with configurable resolution (1m, 5m, 15m, 1h, 6h, 1d)
  • Per-GPU history — Individual GPU device telemetry over time
  • Export — Download telemetry data as JSON or CSV

Redis (Optional)

Redis is an optional dependency configured via REDIS_URL. When available, it serves three roles:

  • Inflight streaming token counting — Atomically tracks tokens currently being streamed across all workers, enabling accurate real-time throughput metrics.
  • Per-user quota token caching — Caches per-user token counters (quota:tokens:{user_id}) for fast atomic increments without hitting the database on every request.
  • Graceful degradation — All features that depend on Redis continue to work without it. Quota enforcement falls back to database queries, and inflight token counts default to zero. No functionality is lost; only caching benefits are reduced.

Chat System

MindRouter includes a built-in chat interface at /chat with full conversation management.

Conversations

  • Each user has their own conversation history
  • Conversations store: title, selected model, creation/update timestamps
  • Users can rename, switch models, or delete conversations
  • Up to 50 conversations shown in the sidebar (most recent first)

Messages

  • Messages include role (user/assistant/system) and content
  • Assistant messages are streamed in real-time
  • Messages can be edited or deleted after creation
  • Attachments are linked to individual messages

File Upload

Supported file types and processing:

CategoryExtensionsProcessing
Images.jpg, .jpeg, .png, .gif, .webpResized to max 1536px, compressed JPEG q85, thumbnail generated
Documents.pdfText extracted from all pages, first-page thumbnail generated
Documents.docxText extracted from all paragraphs
Spreadsheets.xlsxAll sheets read, formatted as tab-separated text
Text files.txt, .md, .csv, .json, .html, .htm, .logRead as-is

Limits:

  • Max upload size: 10 MB (configurable via CHAT_UPLOAD_MAX_SIZE_MB)
  • Artifact storage path: /data/artifacts (configurable via ARTIFACT_STORAGE_PATH)
  • Artifact max size: 50 MB (configurable via ARTIFACT_MAX_SIZE_MB)
  • Artifact retention: 365 days

Storage layout:

/artifacts/YYYY/MM/DD/<sha256_prefix>/<full_sha256>_<uuid>.<ext>

Multimodal Model Support

  • Models with multimodal capability are automatically detected by name patterns (e.g., llava, -vl-, vision)
  • When images are sent to a multimodal model, they are included as base64-encoded content blocks
  • When images are sent to a non-multimodal model, they are replaced with a placeholder: [Image omitted -- model does not support multimodal input: filename]
  • A warning modal is shown in the chat UI when uploading images to a non-multimodal model
  • Admins can override multimodal detection per model via the Models admin page

Streaming

Chat responses are streamed in real-time:

  • Backend streaming uses NDJSON (Ollama) or SSE (vLLM/OpenAI)
  • The chat UI renders tokens as they arrive
  • TTFT (time-to-first-token) is tracked for latency monitoring
  • If the client disconnects, the backend request is not cancelled (to prevent DB corruption)

Conversation Retention

Conversations older than CONVERSATION_RETENTION_DAYS (default 730 days / 2 years) are automatically purged by a background task that runs every CONVERSATION_CLEANUP_INTERVAL seconds (default 86400 / once per day). Associated messages and attachments are deleted along with the conversation.


Blog / CMS

MindRouter includes a built-in blog system with public viewing and admin content management. Blog posts are written in Markdown and rendered to HTML with syntax highlighting, tables, fenced code blocks, and table of contents support.

Public Blog

  • Blog listing at /blog — displays all published posts, most recent first
  • Individual posts at /blog/{slug} — renders the post's Markdown content as styled HTML
  • Posts are accessible without authentication

Admin Blog Management

Admin users can manage blog posts at /admin/blog. The admin interface provides a full CMS workflow:

ActionRouteDescription
List all postsGET /admin/blogView all posts (published, draft, and soft-deleted) with status indicators
New postGET /admin/blog/newForm to create a new blog post
Create postPOST /admin/blog/newSubmit new post with title, slug, Markdown content, excerpt, and publish toggle
Edit postGET /admin/blog/{id}/editEdit form for an existing post
Update postPOST /admin/blog/{id}/editSave changes to a post
Toggle publishPOST /admin/blog/{id}/publishPublish a draft or unpublish a live post
Delete postPOST /admin/blog/{id}/deleteSoft-delete a post (not permanently removed from the database)

Post Fields

  • Title — Display title of the post
  • Slug — URL-safe identifier (auto-generated from title or manually set). Used in the public URL as /blog/{slug}
  • Content — Full post body written in Markdown. Supports fenced code blocks with syntax highlighting, tables, and table of contents
  • Excerpt — Optional short summary shown in the blog listing
  • Published — Toggle for publish state. The published_at timestamp is set automatically when a post is first published
  • Author — Automatically set to the admin user who creates the post

Soft Delete

Deleting a post is a soft delete — the post is marked as deleted but remains in the database. Soft-deleted posts are hidden from the public blog listing but still visible in the admin list.


AppConfig System

MindRouter stores runtime configuration in a key-value AppConfig database table. Values are JSON-encoded and can be read or written at any time without restarting the application. This powers several features that need runtime-adjustable settings.

API

Configuration is accessed via two async CRUD functions in backend/app/db/crud.py:

  • get_config_json(db, key, default) — Read a config value, JSON-decoded. Returns default if the key does not exist.
  • set_config(db, key, value, description=None) — Upsert a config value (JSON-encoded). Creates the key if it does not exist, updates it otherwise.

Known Configuration Keys

KeyTypeDefaultDescriptionManaged Via
app.timezonestringNoneDisplay timezone for the dashboard (e.g. "America/Los_Angeles")Site Settings
ollama.enforce_num_ctxbooltrueWhether MindRouter overrides user-supplied num_ctx valuesSite Settings
chat.core_modelslist[]Subset of models shown in the chat interface model selector. Empty means show all.Chat Configuration
chat.default_modelstringnullPre-selected model in the chat interfaceChat Configuration
chat.system_promptstringnullGlobal system prompt prepended to all chat conversations. Blank removes the override.Chat Configuration
chat.max_tokensint16384Default max tokens for chat requests (range: 256–131072)Chat Configuration
chat.temperaturefloatnullDefault temperature for chat requests (range: 0.0–2.0). Null means use model default.Chat Configuration
chat.thinkbool/stringnullDefault thinking mode for chat: true, false, "low", "medium", "high", or null (no default)Chat Configuration

All AppConfig values take effect immediately upon save — no application restart is required. These settings are managed via the Admin Dashboard under Site Settings and Chat Configuration.


Configuration Reference

All settings are loaded from environment variables or .env / .env.prod files. Variable names are case-insensitive.

Application

VariableTypeDefaultDescription
APP_NAMEstrMindRouterApplication name
APP_VERSIONstr(from pyproject.toml)Application version (read dynamically at startup)
DEBUGboolfalseEnable debug mode
RELOADboolfalseAuto-reload on code changes (development)

Database

VariableTypeDefaultDescription
DATABASE_URLstrmysql+pymysql://...MariaDB/MySQL connection string
DATABASE_POOL_SIZEint30Connection pool size
DATABASE_MAX_OVERFLOWint20Max overflow connections beyond pool
DATABASE_ECHOboolfalseLog SQL queries

Cache

VariableTypeDefaultDescription
REDIS_URLstrNoneRedis connection string (optional)

Security

VariableTypeDefaultDescription
SECRET_KEYstrdev-secret-key-...JWT/session signing key (change in production)
JWT_ALGORITHMstrHS256JWT signing algorithm
JWT_EXPIRATION_HOURSint24JWT token lifetime
SESSION_COOKIE_NAMEstrmindrouter_sessionSession cookie name
SESSION_COOKIE_SECUREboolfalseHTTPS-only cookies
SESSION_COOKIE_HTTPONLYbooltrueJavaScript-inaccessible cookies
SESSION_COOKIE_SAMESITEstrlaxSameSite cookie policy
API_KEY_HASH_ALGORITHMstrargon2API key hashing algorithm

Azure AD SSO (Optional)

VariableTypeDefaultDescription
AZURE_AD_CLIENT_IDstrNoneAzure AD application (client) ID
AZURE_AD_CLIENT_SECRETstrNoneAzure AD client secret
AZURE_AD_TENANT_IDstrNoneAzure AD tenant ID
AZURE_AD_REDIRECT_URIstrhttps://<host>/login/azure/authorizedOAuth2 redirect URI
AZURE_AD_DEFAULT_GROUPstrotherDefault group for new Azure AD users

When AZURE_AD_CLIENT_ID and AZURE_AD_TENANT_ID are set, a "Sign in with Microsoft" button appears on the login page. Users are JIT-provisioned on first login — their jobTitle from Microsoft Graph determines group assignment (student/faculty/staff/other). Pre-existing accounts are linked by email.

Azure AD Group Mapping

Group assignment uses substring matching on the user's jobTitle field from Microsoft Graph:

  • jobTitle contains "student"students group
  • jobTitle contains "faculty" or "professor"faculty group
  • jobTitle contains "staff"staff group
  • No match → falls back to AZURE_AD_DEFAULT_GROUP (default: other)

Web Search (Optional)

VariableTypeDefaultDescription
BRAVE_SEARCH_API_KEYstrNoneBrave Search API key (enables web search in chat)
BRAVE_SEARCH_MAX_RESULTSint5Maximum search results to inject as context

When configured, a search toggle appears in the chat input area. Enabling it queries the Brave Search API with the user's message and injects results into the system prompt as additional context.

Conversation Retention

VariableTypeDefaultDescription
CONVERSATION_RETENTION_DAYSint730Conversation retention period (2 years)
CONVERSATION_CLEANUP_INTERVALint86400Cleanup interval in seconds (24 hours)

Artifact Storage

VariableTypeDefaultDescription
ARTIFACT_STORAGE_PATHstr/data/artifactsFile storage directory
ARTIFACT_MAX_SIZE_MBint50Max artifact file size
ARTIFACT_RETENTION_DAYSint365Artifact retention period

Quotas

Quota defaults are now managed per-group in the database via /admin/groups. The environment variables below are deprecated (used only for initial migration seeding) and will be removed in a future release.

VariableTypeDefaultDescription
DEFAULT_TOKEN_BUDGET_STUDENTint100000Deprecated — use group defaults
DEFAULT_TOKEN_BUDGET_STAFFint500000Deprecated — use group defaults
DEFAULT_TOKEN_BUDGET_FACULTYint1000000Deprecated — use group defaults
DEFAULT_TOKEN_BUDGET_ADMINint10000000Deprecated — use group defaults
DEFAULT_RPM_STUDENTint30Deprecated — use group defaults
DEFAULT_RPM_STAFFint60Deprecated — use group defaults
DEFAULT_RPM_FACULTYint120Deprecated — use group defaults
DEFAULT_RPM_ADMINint1000Deprecated — use group defaults
DEFAULT_MAX_CONCURRENT_STUDENTint2Deprecated — use group defaults
DEFAULT_MAX_CONCURRENT_STAFFint4Deprecated — use group defaults
DEFAULT_MAX_CONCURRENT_FACULTYint8Deprecated — use group defaults
DEFAULT_MAX_CONCURRENT_ADMINint50Deprecated — use group defaults

Scheduler

Scheduler weights are now managed per-group in the database. The per-role weight variables below are deprecated and will be removed in a future release.

VariableTypeDefaultDescription
SCHEDULER_WEIGHT_STUDENTint1Deprecated — use group scheduler_weight
SCHEDULER_WEIGHT_STAFFint2Deprecated — use group scheduler_weight
SCHEDULER_WEIGHT_FACULTYint3Deprecated — use group scheduler_weight
SCHEDULER_WEIGHT_ADMINint10Deprecated — use group scheduler_weight
SCHEDULER_FAIRNESS_WINDOWint300Fairness tracking window (seconds)
SCHEDULER_DEPRIORITIZE_THRESHOLDfloat0.5Usage threshold for deprioritization
SCHEDULER_SCORE_MODEL_LOADEDint100Score bonus for pre-loaded model
SCHEDULER_SCORE_LOW_UTILIZATIONint50Score bonus for low GPU utilization
SCHEDULER_SCORE_LATENCYint40Score factor for low latency
SCHEDULER_SCORE_SHORT_QUEUEint30Score factor for short queue
SCHEDULER_SCORE_HIGH_THROUGHPUTint20Score factor for high throughput

Latency Tracking

VariableTypeDefaultDescription
LATENCY_EMA_ALPHAfloat0.3EMA smoothing factor
LATENCY_EMA_PERSIST_INTERVALint30EMA persistence interval (seconds)

Backend Registry

VariableTypeDefaultDescription
BACKEND_POLL_INTERVALint30Health check interval (seconds)
BACKEND_HEALTH_TIMEOUTint5Health check timeout (seconds)
BACKEND_UNHEALTHY_THRESHOLDint3Failed checks before marking unhealthy
BACKEND_CIRCUIT_BREAKER_THRESHOLDint3Failures before circuit opens
BACKEND_CIRCUIT_BREAKER_RECOVERY_SECONDSint30Circuit breaker recovery time
BACKEND_ADAPTIVE_POLL_FAST_INTERVALint10Fast poll interval after unhealthy
BACKEND_ADAPTIVE_POLL_FAST_DURATIONint120Duration of fast polling (seconds)

Request Handling

VariableTypeDefaultDescription
MAX_REQUEST_SIZEint52428800Max HTTP request body (50 MB)
BACKEND_REQUEST_TIMEOUTint300Total request timeout (seconds)
BACKEND_REQUEST_TIMEOUT_PER_ATTEMPTint180Per-attempt timeout (seconds, high for large model prefills)
BACKEND_RETRY_MAX_ATTEMPTSint3Max total retry attempts
BACKEND_RETRY_ATTEMPTSint2Default retry attempts (deprecated — not currently used by retry logic)
BACKEND_RETRY_BACKOFFfloat1.0Retry backoff multiplier (deprecated — not currently used by retry logic)
STRUCTURED_OUTPUT_RETRY_ON_INVALIDbooltrueRetry on invalid structured output

Logging

VariableTypeDefaultDescription
LOG_LEVELstrINFOLog level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FORMATstrjsonLog format (json or text)
LOG_FILEstrNoneLog file path (optional, stdout if not set)

Audit Logging

VariableTypeDefaultDescription
AUDIT_LOG_ENABLEDbooltrueEnable audit logging
AUDIT_LOG_PROMPTSbooltrueLog user prompts
AUDIT_LOG_RESPONSESbooltrueLog LLM responses

Telemetry & GPU

VariableTypeDefaultDescription
TELEMETRY_RETENTION_DAYSint30Telemetry data retention period
TELEMETRY_CLEANUP_INTERVALint3600Cleanup interval (seconds)
SIDECAR_TIMEOUTint5Sidecar HTTP call timeout (seconds)

Observability

VariableTypeDefaultDescription
METRICS_ENABLEDbooltrueEnable Prometheus metrics
METRICS_PREFIXstrmindrouterMetrics name prefix
OTEL_ENABLEDboolfalseEnable OpenTelemetry
OTEL_EXPORTER_OTLP_ENDPOINTstrNoneOpenTelemetry exporter endpoint

CORS

VariableTypeDefaultDescription
CORS_ORIGINSlist["http://localhost:3000", "http://localhost:8000"]Allowed origins (JSON array or comma-separated)

Chat UI

VariableTypeDefaultDescription
CHAT_FILES_PATHstr/data/chat_filesChat file upload directory
CHAT_UPLOAD_MAX_SIZE_MBint10Max upload file size (MB)
CHAT_UPLOAD_ALLOWED_EXTENSIONSlistSee belowAllowed upload file extensions

Default allowed extensions: .txt, .md, .csv, .json, .html, .htm, .log, .docx, .xlsx, .pdf, .jpg, .jpeg, .png, .gif, .webp

Tokenizer

VariableTypeDefaultDescription
DEFAULT_TOKENIZERstrcl100k_baseDefault tokenizer encoding

Status Enums Reference

MindRouter uses the following status enumerations throughout the system:

BackendStatus

  • HEALTHY — Backend is online and accepting requests
  • UNHEALTHY — Backend is failing health checks
  • DISABLED — Backend has been manually disabled by an admin
  • DRAINING — Backend is finishing in-flight requests but not accepting new ones
  • UNKNOWN — Backend status has not yet been determined

NodeStatus

  • ONLINE — Node sidecar is reachable and reporting metrics
  • OFFLINE — Node sidecar is unreachable
  • UNKNOWN — Node status has not yet been determined

RequestStatus

  • QUEUED — Request is waiting for a backend slot
  • PROCESSING — Request is being handled by a backend
  • COMPLETED — Request finished successfully
  • FAILED — Request encountered an error
  • CANCELLED — Request was cancelled (e.g., client disconnect)

Production Security Hardening

When deploying MindRouter to production, apply the following security measures:

  • Secure session cookies — Set SESSION_COOKIE_SECURE=True to ensure session cookies are only sent over HTTPS.
  • Security headers — Add Strict-Transport-Security (HSTS), X-Frame-Options, and Content-Security-Policy (CSP) headers at the reverse proxy layer (e.g., nginx).
  • Restrict CORS origins — Set CORS_ORIGINS to only the specific domains that need API access. Do not use * in production.

Deployment

MindRouter is designed for deployment on Linux servers with NVIDIA GPUs. The full deployment guide covers:

  • Rocky Linux 8 prerequisites and dependency installation
  • SSL/TLS configuration (self-signed and Let's Encrypt)
  • Apache reverse proxy setup
  • Firewall and SELinux configuration
  • Docker Compose production stack
  • Database migrations
  • GPU sidecar agent deployment
  • Node and backend registration
  • Verification and ongoing operations

GPU Sidecar Deployment

The GPU sidecar agent runs on each inference node to expose per-GPU hardware metrics and enable auto-discovery of inference endpoints. Build and deploy directly from GitHub — no need to clone the repository.

Create the env file (once per node)

sudo mkdir -p /etc/mindrouter
python3 -c "import secrets; print('SIDECAR_SECRET_KEY=' + secrets.token_hex(32))" | sudo tee /etc/mindrouter/sidecar.env
sudo chmod 600 /etc/mindrouter/sidecar.env

Build a specific version

docker build -t mindrouter-sidecar:v2.0.0 \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar

Build latest from master

docker build -t mindrouter-sidecar:latest \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git:sidecar

Run the sidecar

# Run bound to localhost only (nginx will proxy external traffic)
docker run -d --name gpu-sidecar \
  --gpus all \
  -p 127.0.0.1:18007:8007 \
  --env-file /etc/mindrouter/sidecar.env \
  --restart unless-stopped \
  mindrouter-sidecar:v2.0.0

Upgrade to a new version

docker build -t mindrouter-sidecar:v2.0.0 \
  -f Dockerfile.sidecar \
  https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
docker stop gpu-sidecar && docker rm gpu-sidecar
docker run -d --name gpu-sidecar \
  --gpus all \
  -p 127.0.0.1:18007:8007 \
  --env-file /etc/mindrouter/sidecar.env \
  --restart unless-stopped \
  mindrouter-sidecar:v2.0.0

In production, bind the sidecar to localhost only (as shown above) and use an nginx reverse proxy on port 8007 to handle external traffic. The env file at /etc/mindrouter/sidecar.env persists the secret key across upgrades — generate it once per node and it's reused automatically.


Testing

MindRouter has a comprehensive test suite covering unit, integration, end-to-end, smoke, stress, and accessibility tests.

Quick Reference

CommandDescription
make test-unitRun unit tests (525+ tests)
make test-intIntegration tests (requires live backends)
make test-e2eEnd-to-end tests
make test-smokeSmoke tests (full API surface)
make test-stressLoad/stress tests
make test-a11yWCAG 2.1 accessibility tests
make test-sidecarGPU sidecar agent tests
make test-allRun all test suites