🚀 Pybricks RAG Starter Kit
This project provides a beginner-friendly RAG (Retrieval Augmented Generation) solution for Pybricks development. It includes scripts to ingest official Pybricks documentation and tutorials, build a ChromaDB collection, and expose useful tools through a simple MCP-style server.
🚀 Quick Start
Let's get you up and running with this starter kit. You'll need two key files:
ingest_pybricks.py
: This script crawls the official Pybricks docs, tutorials, and key GitHub pages. It then chunks the content nicely and builds a ChromaDB collection.
mcp_server.py
: A minimal MCP-style tool server that exposes functions like docs.search
, code.search
, snippet.suggest
, and compare.versions
using the Chroma collection.
✨ Features
- Targeted Crawling: Focuses on the most useful and stable parts of the Pybricks documentation for beginners, including API references, tutorials, and changelogs.
- Multi - View Indexing: Creates multiple views of the documents (full text, headings/signatures, and identifiers) for better retrieval.
- Version Awareness: Tags each chunk with version information, allowing for version - filtered searches.
- Simple Tool Server: Exposes useful tools through a FastAPI server that your agent can call.
📦 Installation
First, make sure you have the necessary Python packages installed. You can install them using pip
:
pip install requests beautifulsoup4 lxml markdownify tiktoken chromadb sentence-transformers fastapi uvicorn
💻 Usage Examples
Ingesting Data
To ingest the Pybricks documentation and build the ChromaDB collection, run the following command:
python ingest_pybricks.py
Starting the Tool Server
After ingesting the data, start the tool server:
uvicorn mcp_server:app --reload --port 8765
Using the Tools
Your MCP agent can call the following endpoints:
/tools/docs.search
: Hybrid dense search with optional version/source filters.
/tools/code.search
: Prioritizes identifier hits.
/tools/snippet.suggest
: Returns a template for a given goal and hardware.
/tools/compare.versions
: Checks changelogs for mentions of a feature.
ingest_pybricks.py
Code Example
"""
Build a beginner-friendly Pybricks RAG corpus into ChromaDB.
- Crawls/loads selected Pybricks docs/tutorials + key GitHub changelogs.
- Normalizes & chunks (HTML/Markdown-aware; preserves code blocks).
- Creates multi-view docs (full text + headings/signatures + identifiers).
- Indexes into ChromaDB collections: pybricks_docs and pybricks_snippets.
Requires: requests, beautifulsoup4, lxml, markdownify, tiktoken, chromadb, sentence-transformers
pip install requests beautifulsoup4 lxml markdownify tiktoken chromadb sentence-transformers
"""
import re
import time
import json
import hashlib
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import tiktoken
import chromadb
from chromadb.utils import embedding_functions
SEED_URLS = [
"https://docs.pybricks.com/en/latest/robotics.html",
"https://docs.pybricks.com/en/latest/parameters/",
"https://docs.pybricks.com/",
"https://docs.pybricks.com/en/v3.5.0/robotics.html",
"https://docs.pybricks.com/en/v3.3.0/robotics.html",
"https://pybricks.com/learn/getting-started/pybricks-environment/",
"https://pybricks.com/learn/getting-started/install-pybricks/",
"https://pybricks.com/projects/tutorials/wireless/hub-to-device/pc-communication/",
"https://pybricks.com/projects/tutorials/wireless/hub-to-hub/broadcast/",
"https://pybricks.com/projects/tutorials/wireless/remote-control/button-basics/",
"https://pybricks.com/projects/sets/mindstorms-robot-inventor/other-models/quick-reference/",
"https://code.pybricks.com/",
"https://github.com/pybricks/pybricks-micropython/blob/master/CHANGELOG.md",
"https://github.com/pybricks/pybricks-code/blob/master/CHANGELOG.md",
"https://docs.pybricks.com/projects/pybricksdev/en/latest/api/",
]
ALLOWED_DOMAINS = {
"docs.pybricks.com",
"pybricks.com",
"code.pybricks.com",
"github.com",
}
MAX_PAGES = 80
CRAWL_TIMEOUT = 15
USER_AGENT = "pybricks-rag-ingestor/1.0 (+for educational/FLL use)"
DOC_CHUNK_TOKENS = 600
DOC_OVERLAP_TOKENS = 100
CHROMA_DIR = "./chroma_pybricks"
COLL_DOCS = "pybricks_docs"
COLL_SNIPPETS = "pybricks_snippets"
EMBED_MODEL = "all-MiniLM-L6-v2"
tok = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
return len(tok.encode(text))
def hash_id(s: str) -> str:
return hashlib.sha256(s.encode("utf-8")).hexdigest()[:32]
def within_domain(url: str) -> bool:
try:
host = urlparse(url).netloc
return any(host.endswith(d) for d in ALLOWED_DOMAINS)
except Exception:
return False
def get(url: str) -> requests.Response:
return requests.get(url, headers={"User-Agent": USER_AGENT}, timeout=CRAWL_TIMEOUT)
def extract_links(url: str, soup: BeautifulSoup):
for a in soup.select("a[href]"):
href = a["href"]
if href.startswith("#"):
continue
u = urljoin(url, href)
if within_domain(u):
yield u.split("#")[0]
def html_to_markdown_keep_code(html: str) -> str:
text = md(html, heading_style="ATX", strip=["nav","footer","script","style","noscript"])
text = re.sub(r"\n{3,}", "\n\n", text).strip()
return text
def normalize_markdown(text: str) -> str:
text = re.sub(r"[ \t]+\n", "\n", text)
return text.strip()
HEADING_PATTERN = re.compile(r"^(#{1,3})\s+(.*)", re.MULTILINE)
def split_by_headings(markdown: str):
"""
Split markdown by H1–H3. Keep fenced code blocks with their section.
"""
lines = markdown.splitlines()
sections = []
current = {"title": None, "content": []}
def push():
if current["content"]:
sections.append({
"title": current["title"] or "",
"text": "\n".join(current["content"]).strip()
})
for i, line in enumerate(lines):
if HEADING_PATTERN.match(line):
if current["content"]:
push()
current = {"title": None, "content": []}
current["title"] = HEADING_PATTERN.match(line).group(2)
current["content"].append(line)
else:
current["content"].append(line)
push()
return [s for s in sections if s["text"]]
def window_tokens(text: str, size_tokens: int, overlap_tokens: int):
ids = tok.encode(text)
n = len(ids)
if n <= size_tokens:
yield text
return
start = 0
while start < n:
end = min(n, start + size_tokens)
chunk_ids = ids[start:end]
yield tok.decode(chunk_ids)
if end == n:
break
start = max(0, end - overlap_tokens)
def identifiers_view(text: str) -> str:
ids = re.findall(r"`([^`]+)`|([A-Z_]{2,})|([A-Za-z_][A-Za-z0-9_]+)|id=\"([^\"]+)\"", text)
flat = []
for g in ids:
flat.extend([x for x in g if x])
flat = [w for w in flat if len(w) > 2]
return "\n".join(sorted(set(flat)))
def crawl():
seen = set()
queue = list(SEED_URLS)
pages = []
while queue and len(pages) < MAX_PAGES:
url = queue.pop(0)
if url in seen or not within_domain(url):
continue
seen.add(url)
try:
r = get(url)
if r.status_code != 200:
continue
ct = r.headers.get("Content-Type","")
html = r.text
soup = BeautifulSoup(html, "lxml")
for tag in soup(["nav","footer","script","style","noscript"]): tag.decompose()
body = soup.body or soup
main = body.select_one("main") or body
markdown = html_to_markdown_keep_code(str(main))
markdown = normalize_markdown(markdown)
pages.append({
"url": url,
"markdown": markdown
})
if any(url.startswith(prefix) for prefix in [
"https://docs.pybricks.com/",
"https://pybricks.com/learn/",
"https://pybricks.com/projects/",
]):
for u in set(extract_links(url, soup)):
if u not in seen and within_domain(u):
if len(pages) + len(queue) < MAX_PAGES:
queue.append(u)
time.sleep(0.3)
except Exception:
continue
return pages
def build_records(pages):
records = []
for p in pages:
url = p["url"]
mdtext = p["markdown"]
version = "latest" if "/en/latest/" in url else (
re.search(r"/en/v?(\d+\.\d+\.\d+)/", url).group(1) if re.search(r"/en/v?(\d+\.\d+\.\d+)/", url) else "stable"
)
sections = split_by_headings(mdtext) or [{"title":"","text":mdtext}]
for sec in sections:
for win in window_tokens(sec["text"], DOC_CHUNK_TOKENS, DOC_OVERLAP_TOKENS):
rec_id = hash_id(url + "|" + sec["title"] + "|" + win[:200])
records.append({
"id": rec_id,
"text": win,
"meta": {
"url": url,
"title": sec["title"],
"version": version,
"source": urlparse(url).netloc,
"breadcrumbs": sec["title"],
"type": "doc"
},
"identifiers": identifiers_view(win)
})
return records
def upsert_chroma(records):
client = chromadb.PersistentClient(path=CHROMA_DIR)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)
docs = client.get_or_create_collection(name=COLL_DOCS, embedding_function=ef, metadata={"hnsw:space":"cosine"})
ids = [r["id"] for r in records]
texts = [r["text"] for r in records]
metas = [r["meta"] for r in records]
docs.add(ids=ids, documents=texts, metadatas=metas)
ident = client.get_or_create_collection(name=f"{COLL_DOCS}_identifiers", embedding_function=ef, metadata={"hnsw:space":"cosine"})
ident.add(
ids=[f"I_{i}" for i in ids],
documents=[r["identifiers"] for r in records],
metadatas=[{**r["meta"], "type":"identifiers"} for r in records]
)
snippets = [
{
"id": "snip_drivebase_basic",
"text": """# Drive straight and turn
from pybricks.hubs import PrimeHub
from pybricks.pupdevices import Motor
from pybricks.parameters import Port
from pybricks.robotics import DriveBase
hub = PrimeHub()
left = Motor(Port.A)
right = Motor(Port.B)
bot = DriveBase(left, right, wheel_diameter=56, axle_track=114)
bot.straight(300) # mm
bot.turn(90) # + is clockwise/right
""",
"meta": {"topic":"drivebase","version":"stable","url":"https://docs.pybricks.com/en/latest/robotics.html"}
},
]
sn = client.get_or_create_collection(name=COLL_SNIPPETS, embedding_function=ef, metadata={"hnsw:space":"cosine"})
sn.add(
ids=[s["id"] for s in snippets],
documents=[s["text"] for s in snippets],
metadatas=[s["meta"] for s in snippets]
)
return client
def main():
print("Crawling…")
pages = crawl()
print(f"Crawled {len(pages)} pages")
records = build_records(pages)
print(f"Built {len(records)} records")
client = upsert_chroma(records)
print("ChromaDB ready at", CHROMA_DIR)
if __name__ == "__main__":
main()
mcp_server.py
Code Example
"""
Minimal MCP-style tool server exposing Pybricks RAG functions over ChromaDB.
Run:
uvicorn mcp_server:app --reload --port 8765
Tools:
- /tools/docs.search
- /tools/code.search
- /tools/snippet.suggest
- /tools/compare.versions
"""
import re
from typing import List, Optional, Dict, Any
from fastapi import FastAPI, Body, Query
from pydantic import BaseModel
import chromadb
from chromadb.utils import embedding_functions
CHROMA_DIR = "./chroma_pybricks"
COLL_DOCS = "pybricks_docs"
COLL_DOCS_IDENT = f"{COLL_DOCS}_identifiers"
COLL_SNIPPETS = "pybricks_snippets"
EMBED_MODEL = "all-MiniLM-L6-v2"
client = chromadb.PersistentClient(path=CHROMA_DIR)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)
docs = client.get_or_create_collection(name=COLL_DOCS, embedding_function=ef)
ident = client.get_or_create_collection(name=COLL_DOCS_IDENT, embedding_function=ef)
snips = client.get_or_create_collection(name=COLL_SNIPPETS, embedding_function=ef)
app = FastAPI(title="Pybricks MCP Tools")
class SearchRequest(BaseModel):
query: str
k: int = 8
filters: Optional[Dict[str, Any]] = None
class SearchResponse(BaseModel):
results: List[Dict[str, Any]]
def apply_filters(metadatas: List[Dict[str, Any]], filters: Dict[str, Any]) -> List[int]:
idxs = []
for i, m in enumerate(metadatas):
ok = True
for k, v in filters.items():
if str(m.get(k)) != str(v):
ok = False
break
if ok:
idxs.append(i)
return idxs
@app.post("/tools/docs.search", response_model=SearchResponse)
def docs_search(req: SearchRequest):
res = docs.query(query_texts=[req.query], n_results=max(req.k, 8), include=["documents","metadatas","distances"])
docs_idxs = list(range(len(res["documents"][0])))
if req.filters:
keep = apply_filters(res["metadatas"][0], req.filters)
docs_idxs = [i for i in docs_idxs if i in keep]
results = []
for i in docs_idxs[:req.k]:
results.append({
"text": res["documents"][0][i],
"score": 1 - res["distances"][0][i],
"meta": res["metadatas"][0][i],
})
return {"results": results}
class CodeSearchRequest(BaseModel):
symbol: str
k: int = 8
version: Optional[str] = None
@app.post("/tools/code.search", response_model=SearchResponse)
def code_search(req: CodeSearchRequest):
res = ident.query(query_texts=[req.symbol], n_results=max(req.k, 8), include=["documents","metadatas","distances"])
out = []
for i in range(len(res["documents"][0])):
m = res["metadatas"][0][i]
if req.version and m.get("version") != req.version:
continue
out.append({
"text": res["documents"][0][i],
"score": 1 - res["distances"][0][i],
"meta": m,
})
if not out:
return docs_search(SearchRequest(query=req.symbol, k=req.k, filters={"version": req.version} if req.version else None))
return {"results": out[:req.k]}
class SnippetSuggestRequest(BaseModel):
goal: str
hardware: Optional[str] = None
k: int = 3
@app.post("/tools/snippet.suggest", response_model=SearchResponse)
def snippet_suggest(req: SnippetSuggestRequest):
query = req.goal + (" " + req.hardware if req.hardware else "")
res = snips.query(query_texts=[query], n_results=req.k, include=["documents","metadatas","distances"])
results = [{
"text": res["documents"][0][i],
"score": 1 - res["distances"][0][i],
"meta": res["metadatas"][0][i],
} for i in range(len(res["documents"][0]))]
return {"results": results}
class CompareVersionsRequest(BaseModel):
feature: str
k: int = 6
@app.post("/tools/compare.versions", response_model=SearchResponse)
def compare_versions(req: CompareVersionsRequest):
res = docs.query(query_texts=[f"CHANGELOG {req.feature}"], n_results=20, include=["documents","metadatas","distances"])
hits = []
for i in range(len(res["documents"][0])):
m = res["metadatas"][0][i]
if "github.com/pybricks" in m.get("url",""):
hits.append({
"text": res["documents"][0][i],
"score": 1 - res["distances"][0][i],
"meta": m
})
return {"results": hits[:req.k]}
📚 Documentation
What “Beginner - Friendly” Means
- Python - Centric: Explains what to import, which port to use, and what the units are, rather than focusing on firmware internals.
- Official API Preference: Prefers official API calls and examples over clever hacks.
- Runtime Mismatch Detection: Detects MicroPython vs CPython mismatches and provides appropriate guidance.
Source of Truth to Ingest
The following sources are ingested to build the corpus:
- Official API Docs: Both stable and latest versions, covering hubs, sensors, motors, parameters, and robotics helpers.
- Pybricks Code Basics: Information about the IDE, including pairing, running, and limits.
- Getting Started Posts: Useful for setup FAQs.
- Release/Changelog Snippets: To track breaking changes and new features.
- Selected Project Tutorials: For worked examples.
- Pybricks API Repo/PyPI: As a reference for versions.
- Landing Page: For a brief explanation of Pybricks.
Preprocessing & Chunking
Markdown/HTML Docs
- Split by H2/H3 headings, keeping fenced code, tables, and notes together.
- Overlap with 80 - 120 tokens.
- Include metadata such as page URL, version, breadcrumbs, and HTML
id
anchors.
Python Examples/Snippets
- Process per function/class using AST or Tree - sitter, including signature, docstring, and imports.
- If long, window with 80 - 120 token overlap and repeat the signature.
- Create a second “interface” vector with only the signature and docstring.
Indexing & Retrieval
- Hybrid Search: Combines dense embeddings with BM25/regex for symbol hits.
- Parent - Child Structure: Embeds small chunks and keeps a “parent” section/file for expansion.
- Multi - Vector per Chunk: Includes full text, headings/signature view, and identifiers - only view.
- Version Filter: Tags each chunk with version, channel, and source information.
- Reranking: Prioritizes exact symbol match, version proximity, and doc section type.
MCP Server Design
The MCP server exposes the following tools:
- docs.search(query, filters): Returns passages and metadata, with optional filters for symbol, hub, version, and category.
- code.search(symbol): Performs an exact/regex search across examples.
- explain.error(message, context): Maps common runtime messages or gotchas to fixes.
- snippet.suggest(goal, hardware): Returns a template for a given goal and hardware.
- compare.versions(feature): Checks changelogs for mentions of a feature.
Guardrails & Beginner Heuristics
- Detects and corrects attempts to use desktop - only libraries or features not available on the hub.
- Explains units and sign conventions with short examples.
- Warns if a method requires specific sensors or hubs.
- Displays a version banner when an answer uses a new feature.
Answer Patterns
- “How do I…?”: Provides an action recipe with imports, code, notes on signs/units, and a link to the docs.
- “What does this error mean?”: Maps error messages to common causes and provides a fix and explanation.
- “Why isn’t my motor moving?”: Checks a checklist and provides a
run_angle
demo and a link to the Motor docs.
- “Which port constant do I use?”: Answers with port constants and mentions sensor vs motor ports, with a link to the parameters/constants docs.
Prebuilt Snippet Templates
The snippet.suggest
tool provides templates for common tasks, such as DriveBase
setup, single - motor movement, reading hub orientation, and BLE communication.
Evaluation Plan
A test set of 25 real questions is used to evaluate the answers based on correctness, actionable code, citation presence, and beginner clarity. Chunk sizes and rerankers are iterated to improve performance.
Ops: Keeping it Current
- Crawl “stable” docs weekly and “latest” docs monthly.
- Monitor changelogs for features that affect the beginner experience.
- Mark snippets with a version guard if there is a conflict with a change.
Quick Pitfalls
The assistant should proactively catch issues such as suggesting desktop - only libraries, mixing hub - specific details, getting sign conventions wrong, and misunderstanding the gears
parameter.
Day - One Stack Recommendation
- Embedder: Any strong general code/text embedding model can be used.
- Index: pgvector/FAISS with two collections: “api_docs” and “examples/snippets”.
- Rerank: Use BM25 over the top - 50 dense hits and add a symbol regex bonus.
- Caching: Store resolved answers for faster classroom use.
🔧 Technical Details
Crawling and Indexing
The ingest_pybricks.py
script crawls the specified URLs, extracts relevant content, normalizes and chunks it, and then indexes it into ChromaDB. It uses various techniques to ensure that the content is properly structured and indexed for efficient retrieval.
Server Design
The mcp_server.py
script uses FastAPI to expose the tools over HTTP. It interacts with the ChromaDB collections to perform searches and return relevant results.
📄 License
No license information was provided in the original README. If applicable, please ensure that you comply with the appropriate license terms.