Token Flow — Simulated Cluster Throughput

0 tok/s

0 requests/min · 0 active

Total Tokens Served

                    512,847,293
                

MindRouter

Open-Source LLM Inference Load Balancer

Route inference requests across GPU backends with protocol translation, fair-share scheduling, and full observability. Supports OpenAI, Ollama, and Anthropic APIs.

Get Started Documentation

Multi-Backend
Load Balancing

3 API Protocols
OpenAI · Ollama · Anthropic

GPU Cluster
Management

Fair-Share
Scheduling

Open Source
Apache 2.0

Everything You Need for LLM Inference

MindRouter provides a complete platform for running, managing, and integrating large language models — from a browser-based chat to full OpenAI, Ollama, and Anthropic API compatibility.

Multi-API Support

OpenAI, Ollama & Anthropic APIs

MindRouter speaks three API protocols natively. Use the OpenAI, Ollama, or Anthropic SDK of your choice — MindRouter translates between them and routes to the best available backend.

Full /v1/chat/completions OpenAI compatibility
Native Ollama API (/api/chat, /api/generate, /api/tags)
Anthropic Messages API (/v1/messages)
Streaming, tool calling & structured output across all APIs
Image, multimodal & embedding support

curl — Chat Completions

curl -X POST https://your-server.example.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Python Integration

Works with the OpenAI Python SDK

Change two lines of code — base_url and api_key — and your existing OpenAI Python code works instantly with MindRouter. No new libraries to learn.

Use openai Python package as-is
Async support with AsyncOpenAI
Streaming with server-sent events
Compatible with LangChain, LlamaIndex, etc.

example.py

from openai import OpenAI

client = OpenAI(
    base_url="https://your-server.example.com/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "user",
         "content": "Write a haiku about GPUs."}
    ],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Built-in Chat

Interactive Chat Interface

No setup required — just log in and start chatting. MindRouter includes a full-featured web chat with markdown rendering, code highlighting, file uploads, conversation history, and model switching.

Real-time streaming responses
Syntax-highlighted code blocks
File & image uploads
Persistent conversation history
Switch models mid-conversation
Thinking / reasoning mode
Web search integration
Voice input & text-to-speech output

Deploy Your Own

MindRouter Chat

openai/gpt-oss-20b

Write a Python function that finds prime numbers using the Sieve of Eratosthenes.

Here's an efficient implementation:

                                    def sieve(n):

                                        s = [True]*(n+1)

                                        s[:2] = False, False

                                        for i in range(2, int(n**0.5)+1):

                                            if s[i]:

                                                s[i*i::i] = [False]*len(s[i*i::i])

                                        return [i for i, p in enumerate(s) if p]

This runs in O(n log log n) time complexity...

Tool Calling

Native Function & Tool Calling

Build AI agents and agentic workflows. MindRouter supports the OpenAI tool calling protocol, enabling models to invoke functions, query databases, call APIs, and orchestrate complex multi-step tasks.

OpenAI-compatible tools parameter
Parallel tool calls supported
Works with LangChain agents
Structured JSON Schema outputs

tool_calling.py

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user",
               "content": "What's the weather in Moscow, ID?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                }
            }
        }
    }],
)

Translation Architecture

Universal Protocol Translation

MindRouter's innovative translation layer is the engine behind its flexibility. Rather than locking you into a single API format, it translates seamlessly between OpenAI, Ollama, and Anthropic protocols — on both the client and backend sides.

This means any tool, library, or application that speaks one of these protocols can connect to MindRouter — and MindRouter can route to any backend inference engine, regardless of its native API.

OpenAI-compatible — drop-in for any OpenAI SDK or tool
Ollama-compatible — works with Ollama clients and libraries
Anthropic-compatible — supports Claude SDK and tooling
vLLM & Ollama backends — route to any engine transparently
Dozens of models — deploy and serve cutting-edge models instantly

Translation Architecture

OpenAI

Client SDK

Ollama

Client SDK

Anthropic

Client SDK

Canonical Schema

MindRouter

Translate · Route · Balance

vLLM

High-throughput

Ollama

Flexible serving

Documentation & Blog

Comprehensive Docs & Blog

Everything you need to get started and go deep. The documentation covers API endpoints, authentication, model capabilities, rate limits, and integration guides. The blog features updates, tutorials, and best practices.

Documentation

MindRouter Documentation

API Reference

Chat Completions · Embeddings · Models · Authentication

Getting Started

Quick Start Guide · Python SDK · cURL Examples · Rate Limits

Integrations

LangChain · LlamaIndex · Continue.dev · Open WebUI

Advanced

Tool Calling · Structured Output · Streaming · Multimodal

Institutional AI Sovereignty

Your AI. Your Hardware. Your Control.

Every model served by MindRouter runs on University-owned and University-managed GPUs. No data leaves institutional infrastructure. No third-party cloud providers process your prompts. No external APIs see your research, coursework, or sensitive data.

This isn't just a cost decision — it's a strategic one. Institutional AI sovereignty means full control over which models are deployed, how they behave, and where data flows.

Data Security

Prompts and responses never leave University infrastructure. Full FERPA and research data compliance.

Speed & Latency

On-premise GPUs eliminate network round-trips to distant cloud regions.

Power & Sustainability

Direct visibility into energy consumption per model and per request.

Model Control

Choose exactly which models to deploy, audit their behavior, and update on your own schedule.

Full Observability

Complete audit trail of every request. GPU metrics, token usage, and cost accounting.

Predictable Costs

No per-token API bills. Capital investment in GPUs with transparent, fixed operating costs.

University GPU Cluster

Node 1

4x GPU · gpt-oss-120b

ACTIVE

Node 2

4x GPU TP · qwen3.5-400b

ACTIVE

Node 3

4x GPU · gpt-oss-120b / 20b

ACTIVE

Node 4

2x GPU · Multimodal OCR

ACTIVE

On-premise · No cloud dependencies · Full institutional control

Telemetry & Monitoring

Full Cluster Observability

Monitor every aspect of your inference cluster in real time — from GPU utilization and power draw to token throughput and per-request audit trails.

GPU & Inference Metrics

GPU metrics dashboard showing utilization, memory, power consumption, and request throughput charts across 38 GPUs

Live Cluster Status

Cluster status page showing healthy backends, available models, active users, live token throughput visualization, and total tokens served

Token & User Trends

Token throughput and active user charts with hourly, daily, weekly, monthly, and yearly time ranges

Request Audit Log

Audit log showing per-request details including model, endpoint, status, tokens, and latency with filtering and export options

GPU Telemetry

Utilization, memory, temperature, power draw, and fan speed per GPU with configurable time ranges (1h to 30d)

Live Throughput

Real-time token flow visualization, requests per minute, active request counts, and total tokens served

Audit & Export

Full request audit trail with filtering by user, model, status, and date — exportable as CSV, JSON, or JSON with content

Voice AI

Text-to-Speech & Speech-to-Text

OpenAI-compatible TTS and STT endpoints powered by Kokoro and Whisper, running entirely on local GPU infrastructure. Drop-in replacements for the OpenAI Audio API.

Text-to-Speech

curl -X POST https://mindrouter.example.com/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Hello from MindRouter!",
    "voice": "af_heart",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  -o output.mp3

Speech-to-Text

curl -X POST https://mindrouter.example.com/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=json"

OpenAI Compatible

Drop-in replacement for the OpenAI Audio API — same endpoints, same request format, works with existing SDKs

Multiple Voices

Kokoro TTS engine with configurable voices, adjustable speed (0.25x–4x), and multiple output formats (mp3, wav, opus, flac)

On-Premise Privacy

Audio processed entirely on institutional GPUs — your voice data never leaves your infrastructure

Enterprise Infrastructure

Built for Production

MindRouter is designed for reliability, observability, and scale — load balancing across GPU backends with intelligent routing and fair-share scheduling.

Load Balancing

Intelligent routing across multiple GPU backends with health checks and failover

Fair-Share Scheduling

Weighted quotas, rate limiting, and fair access across users and groups

GPU Monitoring

Real-time GPU utilization, memory, temperature, and power metrics across the cluster

Protocol Translation

Seamless translation between OpenAI, Ollama, and vLLM API formats

Observability

Prometheus metrics, structured logging, request audit trail, and admin dashboards

Authentication

API key management, Azure AD SSO, group-based access control, and user quotas

Documentation

Comprehensive guides covering API endpoints, authentication, model capabilities, rate limits, Python SDK integration, tool calling, structured output, and more.

View Full Documentation

Open Source

MindRouter is open-source software released under the Apache 2.0 license. Clone the repo, deploy on your own infrastructure, and contribute back to the project.

View on GitHub Apache 2.0

terminal

$ git clone https://github.com/ui-insight/MindRouter.git
$ cd MindRouter
$ docker compose up -d --build
# MindRouter is now running at http://localhost:8000

Get in Touch

Questions about MindRouter? Interested in deploying it at your institution? Drop us a line.