Privacy-FirstWorks with OllamaMCP ServerMultilingual

The pytest for RAG
Evaluate in 2 commands

Generate QA test datasets from your docs. Score your RAG with LLM-as-judge. Privacy-first — works fully offline with Ollama.

$pip install ragscore

Star on GitHub Try in Colab

PyPI Downloads since Dec 27, 2025

terminal

# Step 1: Generate QA pairs from your docs

$ ragscore generate docs/

✅ Generated 50 QA pairs → output/generated_qas.jsonl

# Step 2: Evaluate your RAG system

$ ragscore evaluate http://localhost:8000/query

============================================================

✅ EXCELLENT: 85/100 correct (85.0%)

Average Score: 4.20/5.0

============================================================

How it works

Three steps. No embeddings. No vector database.

Feed your docs

Point RAGScore at your PDF, TXT, or Markdown files. It reads, chunks, and understands them.

ragscore generate docs/

Generate golden QA

LLM creates diverse question-answer pairs with rationale and evidence spans.

→ output/generated_qas.jsonl

Score your RAG

Each question is sent to your RAG. LLM-as-judge scores the answers 1-5 across 5 metrics.

ragscore evaluate http://rag/query

Everything you need to audit your RAG

No complex setup. No cloud required. Just results.

🔒

100% Private

Run fully offline with Ollama. Your documents never leave your machine. Perfect for healthcare, legal, and finance.

⚡

Lightning Fast

Async concurrency with rate limiting. Evaluate 100 QA pairs in minutes, not hours.

🤖

Any LLM

OpenAI, Anthropic, Ollama, DeepSeek, Groq, Mistral, DashScope — auto-detected from your env vars.

📊

5 Diagnostic Metrics

Correctness, Completeness, Relevance, Conciseness, Faithfulness — all in a single LLM call.

🌍

Multilingual

Auto-detects document language and generates QA pairs in the same language. Chinese, English, Japanese, and more.

🧩

MCP Server

Use RAGScore directly from Claude Desktop or any MCP-compatible AI assistant.

📓

Notebook-Friendly

Works in Jupyter and Google Colab. Rich visualizations with result.plot() and result.df.

🧪

CI/CD Ready

Use quick_test() in pytest. Set accuracy thresholds, get pass/fail, catch regressions automatically.

🔧

Auto-Corrections

Get a list of incorrect answers with corrections. Inject them into your RAG to improve accuracy.

Two ways to use it

Python API for notebooks and scripts. CLI for the terminal.

Python API

# One-liner RAG evaluation

from ragscore import quick_test

result = quick_test(

endpoint="http://localhost:8000/query",

docs="docs/",

detailed=True,

)

result.plot() # Radar chart

result.df # pandas DataFrame

result.corrections # Items to fix

Outputdetailed=True

✅ PASSED: 8/10 correct (80%)

Average Score: 4.2/5.0

──────────────────────────────

Correctness: 4.5/5.0

Completeness: 4.2/5.0

Relevance: 4.8/5.0

Conciseness: 3.9/5.0

Faithfulness: 4.6/5.0

══════════════════════════════

2 corrections available.

Use from Claude Desktop

RAGScore ships with a built-in MCP server. Use it directly from Claude Desktop or any MCP-compatible AI assistant.

Install with MCP support

pip install ragscore[mcp]

Add to Claude Desktop config

{

"mcpServers": {

"ragscore": {

"command": "ragscore",

"args": ["serve"]

}

Ask Claude to evaluate your RAG

"Generate QA pairs from my docs/ folder and evaluate my RAG at http://localhost:8000/query"

Available MCP Tools

generate_qa_dataset

Generate QA pairs from PDFs, TXT, or Markdown files

evaluate_rag

Score your RAG endpoint against a QA dataset

quick_test_rag

Generate + evaluate in one call with pass/fail

get_corrections

Get incorrect answers with suggested fixes

Works with every LLM

Auto-detected from your environment variables. Zero config.

OllamaOpenAIAnthropicDeepSeekGroqMistralDashScopeGrokTogether AIvLLMAny OpenAI-compatible

Who uses RAGScore?

From solo developers to enterprise AI teams.

AI Engineers

Test your RAG pipeline before deploying. Catch hallucinations and missing context early.

"We caught 15 hallucinations in our legal RAG before going live."

MLOps / CI/CD

Add quick_test() to your test suite. Fail the build if accuracy drops below threshold.

assert quick_test(endpoint, docs).passed

Enterprise Teams

Evaluate RAG accuracy across departments. Finance, HR, Legal — each with different accuracy requirements.

Run offline with Ollama. No data leaves your VPC.

AI Consultants

Deliver quantified RAG quality reports to clients. Show before/after improvement metrics.

"Accuracy improved from 62% to 91% after 3 rounds."

Stay in the loop

Get notified about new features, releases, and RAG evaluation tips. No spam.

The pytest for RAGEvaluate in 2 commands