Pybricks MCP: API Doc Search Tool for Beginners' Development

Pybricks MCP

This is a RAG assistant project designed for Pybricks beginners, focusing on providing retrieval and suggestions for official API documentation, tutorials, and code snippets. It is especially suitable for FLL students and coaches. The project includes document crawling, pre - processing, index building, and MCP tool services to ensure accurate answers and be beginner - friendly.

Education and learning tools Developer tools #Pybricks Assistant #Documentation Retrieval #Code Suggestion #Beginner - Friendly .Python

rating : 2 points

downloads : 4.6K

update time : 2025-08-11

Open Site

What is the Pybricks MCP Assistant?

This is an intelligent assistant server designed specifically for beginners in Lego robot programming. It can quickly search the official Pybricks documentation, generate runnable code templates, and compare the functional differences between different versions to help FLL teams solve programming problems.

How to use the Pybricks Assistant?

Through simple API calls, you can query the documentation, obtain code snippets, or check version compatibility. The server is pre - installed with solution templates for common robot tasks.

Applicable Scenarios

Suitable for classroom programming teaching, FLL pre - competition preparation, solving robot control problems (such as motor steering and sensor usage), and learning the Pybricks API.

Main Features

Intelligent Documentation Search

Supports natural language queries of the official Pybricks documentation and can filter results by version.

Code Template Generation

Automatically generates runnable Python code snippets based on task descriptions, including necessary imports and parameter descriptions.

Version Comparison

Checks the availability and changes of specific functions in different Pybricks versions.

Hardware Awareness

Provides adapted code suggestions based on different hardware such as PrimeHub/Inventor.

Advantages

Instant retrieval of official documentation to ensure the accuracy of answers.

Code examples include complete imports and unit descriptions, ready to use out of the box.

Particularly optimized for answering common robot control problems.

Automatically marks the firmware version required for functions to avoid compatibility issues.

Limitations

Currently only supports English queries (most of the documentation is in English).

Cannot directly execute code and needs to be copied to the Pybricks IDE for execution.

Advanced custom functions require technical knowledge for configuration.

How to Use

Start the Server

Run the FastAPI server after installing the dependencies.

Query the Documentation

Send a JSON request to the /docs.search endpoint to query relevant content.

Get Code

Use /snippet.suggest to get the code template for a specific task.

Usage Examples

Basic Movement Control

Make the robot move forward a specified distance and then turn.

Version Compatibility Check

Confirm the available version of the DriveBase.stalled() method.

Frequently Asked Questions

Why does the code report an error in the IDE?

How to specify the motor port?

Why is the turning angle opposite to the expectation?

Related Resources

Pybricks Official Documentation

Complete API reference and tutorials

Pybricks GitHub

Source code and change logs

FLL Official Forum

Competition - related discussion area

🚀 Pybricks RAG Starter Kit

This project provides a beginner-friendly RAG (Retrieval Augmented Generation) solution for Pybricks development. It includes scripts to ingest official Pybricks documentation and tutorials, build a ChromaDB collection, and expose useful tools through a simple MCP-style server.

🚀 Quick Start

Let's get you up and running with this starter kit. You'll need two key files:

ingest_pybricks.py: This script crawls the official Pybricks docs, tutorials, and key GitHub pages. It then chunks the content nicely and builds a ChromaDB collection.
mcp_server.py: A minimal MCP-style tool server that exposes functions like docs.search, code.search, snippet.suggest, and compare.versions using the Chroma collection.

✨ Features

Targeted Crawling: Focuses on the most useful and stable parts of the Pybricks documentation for beginners, including API references, tutorials, and changelogs.
Multi - View Indexing: Creates multiple views of the documents (full text, headings/signatures, and identifiers) for better retrieval.
Version Awareness: Tags each chunk with version information, allowing for version - filtered searches.
Simple Tool Server: Exposes useful tools through a FastAPI server that your agent can call.

📦 Installation

First, make sure you have the necessary Python packages installed. You can install them using pip:

pip install requests beautifulsoup4 lxml markdownify tiktoken chromadb sentence-transformers fastapi uvicorn

💻 Usage Examples

Ingesting Data

To ingest the Pybricks documentation and build the ChromaDB collection, run the following command:

python ingest_pybricks.py

Starting the Tool Server

After ingesting the data, start the tool server:

uvicorn mcp_server:app --reload --port 8765

Using the Tools

Your MCP agent can call the following endpoints:

/tools/docs.search: Hybrid dense search with optional version/source filters.
/tools/code.search: Prioritizes identifier hits.
/tools/snippet.suggest: Returns a template for a given goal and hardware.
/tools/compare.versions: Checks changelogs for mentions of a feature.

`ingest_pybricks.py` Code Example

#!/usr/bin/env python3
"""
Build a beginner-friendly Pybricks RAG corpus into ChromaDB.

- Crawls/loads selected Pybricks docs/tutorials + key GitHub changelogs.
- Normalizes & chunks (HTML/Markdown-aware; preserves code blocks).
- Creates multi-view docs (full text + headings/signatures + identifiers).
- Indexes into ChromaDB collections: pybricks_docs and pybricks_snippets.

Requires: requests, beautifulsoup4, lxml, markdownify, tiktoken, chromadb, sentence-transformers
    pip install requests beautifulsoup4 lxml markdownify tiktoken chromadb sentence-transformers
"""

import re
import time
import json
import hashlib
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import tiktoken

import chromadb
from chromadb.utils import embedding_functions

############################
# Config
############################
SEED_URLS = [
    # API docs (latest & versioned)
    "https://docs.pybricks.com/en/latest/robotics.html",       # DriveBase, etc.
    "https://docs.pybricks.com/en/latest/parameters/",         # Port, Direction, Stop, etc.
    "https://docs.pybricks.com/",                              # root (to discover other key pages)
    "https://docs.pybricks.com/en/v3.5.0/robotics.html",       # older ref (version contrast)
    "https://docs.pybricks.com/en/v3.3.0/robotics.html",

    # Tutorials / Getting started
    "https://pybricks.com/learn/getting-started/pybricks-environment/",
    "https://pybricks.com/learn/getting-started/install-pybricks/",
    "https://pybricks.com/projects/tutorials/wireless/hub-to-device/pc-communication/",
    "https://pybricks.com/projects/tutorials/wireless/hub-to-hub/broadcast/",
    "https://pybricks.com/projects/tutorials/wireless/remote-control/button-basics/",
    "https://pybricks.com/projects/sets/mindstorms-robot-inventor/other-models/quick-reference/",
    "https://code.pybricks.com/",

    # Changelogs (version awareness)
    "https://github.com/pybricks/pybricks-micropython/blob/master/CHANGELOG.md",
    "https://github.com/pybricks/pybricks-code/blob/master/CHANGELOG.md",

    # pybricksdev (optional power tool docs)
    "https://docs.pybricks.com/projects/pybricksdev/en/latest/api/",
]

ALLOWED_DOMAINS = {
    "docs.pybricks.com",
    "pybricks.com",
    "code.pybricks.com",
    "github.com",
}

MAX_PAGES = 80           # keep crawl polite; bump when needed
CRAWL_TIMEOUT = 15
USER_AGENT = "pybricks-rag-ingestor/1.0 (+for educational/FLL use)"

# Chunking defaults
DOC_CHUNK_TOKENS = 600
DOC_OVERLAP_TOKENS = 100

# Chroma config
CHROMA_DIR = "./chroma_pybricks"
COLL_DOCS = "pybricks_docs"
COLL_SNIPPETS = "pybricks_snippets"  # short task-oriented code templates/snips

# Embeddings: sentence-transformers (local)
EMBED_MODEL = "all-MiniLM-L6-v2"

############################
# Helpers
############################
tok = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(tok.encode(text))

def hash_id(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()[:32]

def within_domain(url: str) -> bool:
    try:
        host = urlparse(url).netloc
        return any(host.endswith(d) for d in ALLOWED_DOMAINS)
    except Exception:
        return False

def get(url: str) -> requests.Response:
    return requests.get(url, headers={"User-Agent": USER_AGENT}, timeout=CRAWL_TIMEOUT)

def extract_links(url: str, soup: BeautifulSoup):
    for a in soup.select("a[href]"):
        href = a["href"]
        if href.startswith("#"):
            continue
        u = urljoin(url, href)
        if within_domain(u):
            yield u.split("#")[0]

def html_to_markdown_keep_code(html: str) -> str:
    # Convert HTML to Markdown but keep code blocks and headings clean
    # md() handles most; we'll fix common issues
    text = md(html, heading_style="ATX", strip=["nav","footer","script","style","noscript"])
    # compact extra blank lines
    text = re.sub(r"\n{3,}", "\n\n", text).strip()
    return text

def normalize_markdown(text: str) -> str:
    # de-dup whitespace, make headings spaced
    text = re.sub(r"[ \t]+\n", "\n", text)
    return text.strip()

############################
# Chunkers
############################
HEADING_PATTERN = re.compile(r"^(#{1,3})\s+(.*)", re.MULTILINE)

def split_by_headings(markdown: str):
    """
    Split markdown by H1–H3. Keep fenced code blocks with their section.
    """
    lines = markdown.splitlines()
    sections = []
    current = {"title": None, "content": []}

    def push():
        if current["content"]:
            sections.append({
                "title": current["title"] or "",
                "text": "\n".join(current["content"]).strip()
            })

    for i, line in enumerate(lines):
        if HEADING_PATTERN.match(line):
            # new section
            if current["content"]:
                push()
                current = {"title": None, "content": []}
            current["title"] = HEADING_PATTERN.match(line).group(2)
            current["content"].append(line)
        else:
            current["content"].append(line)

    push()
    return [s for s in sections if s["text"]]

def window_tokens(text: str, size_tokens: int, overlap_tokens: int):
    ids = tok.encode(text)
    n = len(ids)
    if n <= size_tokens:
        yield text
        return
    start = 0
    while start < n:
        end = min(n, start + size_tokens)
        chunk_ids = ids[start:end]
        yield tok.decode(chunk_ids)
        if end == n:
            break
        start = max(0, end - overlap_tokens)

def identifiers_view(text: str) -> str:
    # Pull likely “identifier-ish” tokens: backticked names, UPPER_CASE, CamelCase, Port.X, Direction.X, drivebase methods, HTML ids
    ids = re.findall(r"`([^`]+)`|([A-Z_]{2,})|([A-Za-z_][A-Za-z0-9_]+)|id=\"([^\"]+)\"", text)
    flat = []
    for g in ids:
        flat.extend([x for x in g if x])
    # de-noise trivial words
    flat = [w for w in flat if len(w) > 2]
    return "\n".join(sorted(set(flat)))

############################
# Crawl
############################
def crawl():
    seen = set()
    queue = list(SEED_URLS)
    pages = []

    while queue and len(pages) < MAX_PAGES:
        url = queue.pop(0)
        if url in seen or not within_domain(url):
            continue
        seen.add(url)
        try:
            r = get(url)
            if r.status_code != 200:
                continue
            ct = r.headers.get("Content-Type","")
            html = r.text
            soup = BeautifulSoup(html, "lxml")
            # strip obvious boilerplate
            for tag in soup(["nav","footer","script","style","noscript"]): tag.decompose()
            body = soup.body or soup
            main = body.select_one("main") or body
            # Many docs are Sphinx → good structure + headings
            # Convert to markdown for easier heading split
            markdown = html_to_markdown_keep_code(str(main))
            markdown = normalize_markdown(markdown)

            pages.append({
                "url": url,
                "markdown": markdown
            })

            # discover more links from key roots (docs/tutorials)
            if any(url.startswith(prefix) for prefix in [
                "https://docs.pybricks.com/",
                "https://pybricks.com/learn/",
                "https://pybricks.com/projects/",
            ]):
                for u in set(extract_links(url, soup)):
                    if u not in seen and within_domain(u):
                        # keep within same area
                        if len(pages) + len(queue) < MAX_PAGES:
                            queue.append(u)

            time.sleep(0.3)  # be polite

        except Exception:
            continue

    return pages

############################
# Build records & index
############################
def build_records(pages):
    records = []
    for p in pages:
        url = p["url"]
        mdtext = p["markdown"]
        # version tags from URL
        version = "latest" if "/en/latest/" in url else (
            re.search(r"/en/v?(\d+\.\d+\.\d+)/", url).group(1) if re.search(r"/en/v?(\d+\.\d+\.\d+)/", url) else "stable"
        )
        # sectionize by H1–H3
        sections = split_by_headings(mdtext) or [{"title":"","text":mdtext}]
        for sec in sections:
            # windowing to stay within token budget
            for win in window_tokens(sec["text"], DOC_CHUNK_TOKENS, DOC_OVERLAP_TOKENS):
                rec_id = hash_id(url + "|" + sec["title"] + "|" + win[:200])
                records.append({
                    "id": rec_id,
                    "text": win,
                    "meta": {
                        "url": url,
                        "title": sec["title"],
                        "version": version,
                        "source": urlparse(url).netloc,
                        "breadcrumbs": sec["title"],
                        "type": "doc"
                    },
                    "identifiers": identifiers_view(win)
                })
    return records

############################
# ChromaDB
############################
def upsert_chroma(records):
    client = chromadb.PersistentClient(path=CHROMA_DIR)
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)

    docs = client.get_or_create_collection(name=COLL_DOCS, embedding_function=ef, metadata={"hnsw:space":"cosine"})

    ids = [r["id"] for r in records]
    texts = [r["text"] for r in records]
    metas = [r["meta"] for r in records]
    # store a second view as metadata field for regex/BM25 on your side, but we can also make a tiny aux collection:
    docs.add(ids=ids, documents=texts, metadatas=metas)

    # Optional: create a lightweight identifiers collection to help rerank exact symbol hits
    ident = client.get_or_create_collection(name=f"{COLL_DOCS}_identifiers", embedding_function=ef, metadata={"hnsw:space":"cosine"})
    ident.add(
        ids=[f"I_{i}" for i in ids],
        documents=[r["identifiers"] for r in records],
        metadatas=[{**r["meta"], "type":"identifiers"} for r in records]
    )

    # Starter snippet/templates (task-oriented)
    snippets = [
        {
            "id": "snip_drivebase_basic",
            "text": """# Drive straight and turn
from pybricks.hubs import PrimeHub
from pybricks.pupdevices import Motor
from pybricks.parameters import Port
from pybricks.robotics import DriveBase

hub = PrimeHub()
left = Motor(Port.A)
right = Motor(Port.B)
bot = DriveBase(left, right, wheel_diameter=56, axle_track=114)

bot.straight(300)   # mm
bot.turn(90)        # + is clockwise/right
""",
            "meta": {"topic":"drivebase","version":"stable","url":"https://docs.pybricks.com/en/latest/robotics.html"}
        },
    ]
    sn = client.get_or_create_collection(name=COLL_SNIPPETS, embedding_function=ef, metadata={"hnsw:space":"cosine"})
    sn.add(
        ids=[s["id"] for s in snippets],
        documents=[s["text"] for s in snippets],
        metadatas=[s["meta"] for s in snippets]
    )

    return client

def main():
    print("Crawling…")
    pages = crawl()
    print(f"Crawled {len(pages)} pages")
    records = build_records(pages)
    print(f"Built {len(records)} records")
    client = upsert_chroma(records)
    print("ChromaDB ready at", CHROMA_DIR)

if __name__ == "__main__":
    main()

`mcp_server.py` Code Example

#!/usr/bin/env python3
"""
Minimal MCP-style tool server exposing Pybricks RAG functions over ChromaDB.

Run:
    uvicorn mcp_server:app --reload --port 8765

Tools:
- /tools/docs.search
- /tools/code.search
- /tools/snippet.suggest
- /tools/compare.versions
"""

import re
from typing import List, Optional, Dict, Any
from fastapi import FastAPI, Body, Query
from pydantic import BaseModel
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DIR = "./chroma_pybricks"
COLL_DOCS = "pybricks_docs"
COLL_DOCS_IDENT = f"{COLL_DOCS}_identifiers"
COLL_SNIPPETS = "pybricks_snippets"
EMBED_MODEL = "all-MiniLM-L6-v2"

client = chromadb.PersistentClient(path=CHROMA_DIR)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)

docs = client.get_or_create_collection(name=COLL_DOCS, embedding_function=ef)
ident = client.get_or_create_collection(name=COLL_DOCS_IDENT, embedding_function=ef)
snips = client.get_or_create_collection(name=COLL_SNIPPETS, embedding_function=ef)

app = FastAPI(title="Pybricks MCP Tools")

class SearchRequest(BaseModel):
    query: str
    k: int = 8
    filters: Optional[Dict[str, Any]] = None  # e.g., {"version":"latest"} or {"source":"docs.pybricks.com"}

class SearchResponse(BaseModel):
    results: List[Dict[str, Any]]

def apply_filters(metadatas: List[Dict[str, Any]], filters: Dict[str, Any]) -> List[int]:
    idxs = []
    for i, m in enumerate(metadatas):
        ok = True
        for k, v in filters.items():
            if str(m.get(k)) != str(v):
                ok = False
                break
        if ok:
            idxs.append(i)
    return idxs

@app.post("/tools/docs.search", response_model=SearchResponse)
def docs_search(req: SearchRequest):
    # dense search
    res = docs.query(query_texts=[req.query], n_results=max(req.k, 8), include=["documents","metadatas","distances"])
    docs_idxs = list(range(len(res["documents"][0])))

    # optional hard filter by simple metadata
    if req.filters:
        keep = apply_filters(res["metadatas"][0], req.filters)
        docs_idxs = [i for i in docs_idxs if i in keep]

    results = []
    for i in docs_idxs[:req.k]:
        results.append({
            "text": res["documents"][0][i],
            "score": 1 - res["distances"][0][i],
            "meta": res["metadatas"][0][i],
        })
    return {"results": results}

class CodeSearchRequest(BaseModel):
    symbol: str
    k: int = 8
    version: Optional[str] = None

@app.post("/tools/code.search", response_model=SearchResponse)
def code_search(req: CodeSearchRequest):
    # prioritize identifier view collection; fall back to docs
    res = ident.query(query_texts=[req.symbol], n_results=max(req.k, 8), include=["documents","metadatas","distances"])
    out = []
    for i in range(len(res["documents"][0])):
        m = res["metadatas"][0][i]
        if req.version and m.get("version") != req.version:
            continue
        out.append({
            "text": res["documents"][0][i],
            "score": 1 - res["distances"][0][i],
            "meta": m,
        })
    if not out:
        return docs_search(SearchRequest(query=req.symbol, k=req.k, filters={"version": req.version} if req.version else None))
    return {"results": out[:req.k]}

class SnippetSuggestRequest(BaseModel):
    goal: str
    hardware: Optional[str] = None
    k: int = 3

@app.post("/tools/snippet.suggest", response_model=SearchResponse)
def snippet_suggest(req: SnippetSuggestRequest):
    query = req.goal + (" " + req.hardware if req.hardware else "")
    res = snips.query(query_texts=[query], n_results=req.k, include=["documents","metadatas","distances"])
    results = [{
        "text": res["documents"][0][i],
        "score": 1 - res["distances"][0][i],
        "meta": res["metadatas"][0][i],
    } for i in range(len(res["documents"][0]))]
    return {"results": results}

class CompareVersionsRequest(BaseModel):
    feature: str
    k: int = 6

@app.post("/tools/compare.versions", response_model=SearchResponse)
def compare_versions(req: CompareVersionsRequest):
    # Look for the feature in CHANGELOG pages inside docs collection
    res = docs.query(query_texts=[f"CHANGELOG {req.feature}"], n_results=20, include=["documents","metadatas","distances"])
    hits = []
    for i in range(len(res["documents"][0])):
        m = res["metadatas"][0][i]
        if "github.com/pybricks" in m.get("url",""):
            hits.append({
                "text": res["documents"][0][i],
                "score": 1 - res["distances"][0][i],
                "meta": m
            })
    return {"results": hits[:req.k]}

📚 Documentation

What “Beginner - Friendly” Means

Python - Centric: Explains what to import, which port to use, and what the units are, rather than focusing on firmware internals.
Official API Preference: Prefers official API calls and examples over clever hacks.
Runtime Mismatch Detection: Detects MicroPython vs CPython mismatches and provides appropriate guidance.

Source of Truth to Ingest

The following sources are ingested to build the corpus:

Official API Docs: Both stable and latest versions, covering hubs, sensors, motors, parameters, and robotics helpers.
Pybricks Code Basics: Information about the IDE, including pairing, running, and limits.
Getting Started Posts: Useful for setup FAQs.
Release/Changelog Snippets: To track breaking changes and new features.
Selected Project Tutorials: For worked examples.
Pybricks API Repo/PyPI: As a reference for versions.
Landing Page: For a brief explanation of Pybricks.

Preprocessing & Chunking

Markdown/HTML Docs

Split by H2/H3 headings, keeping fenced code, tables, and notes together.
Overlap with 80 - 120 tokens.
Include metadata such as page URL, version, breadcrumbs, and HTML id anchors.

Python Examples/Snippets

Process per function/class using AST or Tree - sitter, including signature, docstring, and imports.
If long, window with 80 - 120 token overlap and repeat the signature.
Create a second “interface” vector with only the signature and docstring.

Indexing & Retrieval

Hybrid Search: Combines dense embeddings with BM25/regex for symbol hits.
Parent - Child Structure: Embeds small chunks and keeps a “parent” section/file for expansion.
Multi - Vector per Chunk: Includes full text, headings/signature view, and identifiers - only view.
Version Filter: Tags each chunk with version, channel, and source information.
Reranking: Prioritizes exact symbol match, version proximity, and doc section type.

MCP Server Design

The MCP server exposes the following tools:

docs.search(query, filters): Returns passages and metadata, with optional filters for symbol, hub, version, and category.
code.search(symbol): Performs an exact/regex search across examples.
explain.error(message, context): Maps common runtime messages or gotchas to fixes.
snippet.suggest(goal, hardware): Returns a template for a given goal and hardware.
compare.versions(feature): Checks changelogs for mentions of a feature.

Guardrails & Beginner Heuristics

Detects and corrects attempts to use desktop - only libraries or features not available on the hub.
Explains units and sign conventions with short examples.
Warns if a method requires specific sensors or hubs.
Displays a version banner when an answer uses a new feature.

Answer Patterns

“How do I…?”: Provides an action recipe with imports, code, notes on signs/units, and a link to the docs.
“What does this error mean?”: Maps error messages to common causes and provides a fix and explanation.
“Why isn’t my motor moving?”: Checks a checklist and provides a run_angle demo and a link to the Motor docs.
“Which port constant do I use?”: Answers with port constants and mentions sensor vs motor ports, with a link to the parameters/constants docs.

Prebuilt Snippet Templates

The snippet.suggest tool provides templates for common tasks, such as DriveBase setup, single - motor movement, reading hub orientation, and BLE communication.

Evaluation Plan

A test set of 25 real questions is used to evaluate the answers based on correctness, actionable code, citation presence, and beginner clarity. Chunk sizes and rerankers are iterated to improve performance.

Ops: Keeping it Current

Crawl “stable” docs weekly and “latest” docs monthly.
Monitor changelogs for features that affect the beginner experience.
Mark snippets with a version guard if there is a conflict with a change.

Quick Pitfalls

The assistant should proactively catch issues such as suggesting desktop - only libraries, mixing hub - specific details, getting sign conventions wrong, and misunderstanding the gears parameter.

Day - One Stack Recommendation

Embedder: Any strong general code/text embedding model can be used.
Index: pgvector/FAISS with two collections: “api_docs” and “examples/snippets”.
Rerank: Use BM25 over the top - 50 dense hits and add a symbol regex bonus.
Caching: Store resolved answers for faster classroom use.

🔧 Technical Details

Crawling and Indexing

The ingest_pybricks.py script crawls the specified URLs, extracts relevant content, normalizes and chunks it, and then indexes it into ChromaDB. It uses various techniques to ensure that the content is properly structured and indexed for efficient retrieval.

Server Design

The mcp_server.py script uses FastAPI to expose the tools over HTTP. It interacts with the ChromaDB collections to perform searches and return relevant results.