Generate QA test datasets from your docs. Score your RAG with LLM-as-judge. Privacy-first โ works fully offline with Ollama.
# Step 1: Generate QA pairs from your docs
$ ragscore generate docs/
โ Generated 50 QA pairs โ output/generated_qas.jsonl
# Step 2: Evaluate your RAG system
$ ragscore evaluate http://localhost:8000/query
============================================================
โ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
Three steps. No embeddings. No vector database.
Point RAGScore at your PDF, TXT, or Markdown files. It reads, chunks, and understands them.
ragscore generate docs/LLM creates diverse question-answer pairs with rationale and evidence spans.
โ output/generated_qas.jsonlEach question is sent to your RAG. LLM-as-judge scores the answers 1-5 across 5 metrics.
ragscore evaluate http://rag/queryNo complex setup. No cloud required. Just results.
Run fully offline with Ollama. Your documents never leave your machine. Perfect for healthcare, legal, and finance.
Async concurrency with rate limiting. Evaluate 100 QA pairs in minutes, not hours.
OpenAI, Anthropic, Ollama, DeepSeek, Groq, Mistral, DashScope โ auto-detected from your env vars.
Correctness, Completeness, Relevance, Conciseness, Faithfulness โ all in a single LLM call.
Auto-detects document language and generates QA pairs in the same language. Chinese, English, Japanese, and more.
Use RAGScore directly from Claude Desktop or any MCP-compatible AI assistant.
Works in Jupyter and Google Colab. Rich visualizations with result.plot() and result.df.
Use quick_test() in pytest. Set accuracy thresholds, get pass/fail, catch regressions automatically.
Get a list of incorrect answers with corrections. Inject them into your RAG to improve accuracy.
Python API for notebooks and scripts. CLI for the terminal.
# One-liner RAG evaluation
from ragscore import quick_test
result = quick_test(
endpoint="http://localhost:8000/query",
docs="docs/",
detailed=True,
)
result.plot() # Radar chart
result.df # pandas DataFrame
result.corrections # Items to fix
โ PASSED: 8/10 correct (80%)
Average Score: 4.2/5.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Correctness: 4.5/5.0
Completeness: 4.2/5.0
Relevance: 4.8/5.0
Conciseness: 3.9/5.0
Faithfulness: 4.6/5.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 corrections available.
RAGScore ships with a built-in MCP server. Use it directly from Claude Desktop or any MCP-compatible AI assistant.
pip install ragscore[mcp]{
"mcpServers": {
"ragscore": {
"command": "ragscore",
"args": ["serve"]
}
}
}
"Generate QA pairs from my docs/ folder and evaluate my RAG at http://localhost:8000/query"
generate_qa_datasetGenerate QA pairs from PDFs, TXT, or Markdown files
evaluate_ragScore your RAG endpoint against a QA dataset
quick_test_ragGenerate + evaluate in one call with pass/fail
get_correctionsGet incorrect answers with suggested fixes
Auto-detected from your environment variables. Zero config.
From solo developers to enterprise AI teams.
Test your RAG pipeline before deploying. Catch hallucinations and missing context early.
"We caught 15 hallucinations in our legal RAG before going live."
Add quick_test() to your test suite. Fail the build if accuracy drops below threshold.
assert quick_test(endpoint, docs).passed
Evaluate RAG accuracy across departments. Finance, HR, Legal โ each with different accuracy requirements.
Run offline with Ollama. No data leaves your VPC.
Deliver quantified RAG quality reports to clients. Show before/after improvement metrics.
"Accuracy improved from 62% to 91% after 3 rounds."
Get notified about new features, releases, and RAG evaluation tips. No spam.