Recently, I wrote a post about an educational app I’d developed using AI tools, and the design decisions I made along the way.

When I showed the prototype of my activity-based learning app to a few educators, one suggestion came up repeatedly that was drawn from their own experience hunting for creative ideas on platforms like Pinterest and TikTok. They wanted a feature that could pull project ideas from across the internet based on practical search criteria: the materials they have access to, and what they’d like the end product to look like.

The app already has a basic search that returns results from its own activity data, but that data is still limited at this stage. Generating results from outside the app felt like something LLMs are well suited to handle.

I was also curious to learn how you actually teach a small LLM — not the kind that needs enormous datasets and compute (which I don’t have access to), but the mechanics of it, for learning’s sake. And, like in my previous post, I wanted to think through the design choices that go into it:

  • What are the technicalities behind teaching a small LLM to handle a K12 use case?
  • How, and on what data, do you train such a model?
  • How do you ensure the model is child friendly?
  • What does it take to integrate the model into your app?

In this post, I’ll document everything I learned about training such a model and integrating it as a feature in my educational prototype.

Table of Contents

Prerequisites

This is a hands-on tutorial, so here’s what will help you follow along or train the model yourself.

Skills you’ll want

  • Using Claude on the command line.
  • Basic Python: reading code, installing and using packages, calling APIs, and making sense of output like log files.
  • Reading a bit of TypeScript, since that’s what the app’s frontend is built in.
  • Most importantly, being comfortable following Claude’s reasoning, weighing the options it lays out, and deciding what to do next. That back-and-forth — not any single command — is really the core skill this kind of project asks for.

You don’t need a background in machine learning. The post tries to explain the ML concepts as it goes, in plain language.

Setup you’ll need

  • An Apple Silicon Mac (M1/M2/M3 or newer). The fine-tuning step uses MLX, Apple’s framework, which only runs on Apple Silicon.
  • Python 3 with a virtual environment (python3 -m venv).
  • Ollama installed, with the Qwen 2.5 7B model pulled (ollama pull qwen2.5:7b), for generating the training data locally. You’ll want enough RAM to run a 7B model.
  • Claude on the command line, for working through the build.

Dataset Preparation

For this experiment, I wanted the activity data to be grounded in local cultures from around the world. This would help the model suggest creative project ideas that inspire the facilitation of cultural activities in educational settings.

I’d come across a lot of Wikipedia articles on local arts and traditions over the years. Wikipedia is my favorite resource for information: it’s human-first, its content is updated frequently, and as an open source project its APIs are free to use. So I decided to use Wikipedia data to teach my model.

The genuinely hands-on part of this stage was seeding the right categories. In a Python script, I defined ~40 seed categories and grouped them under 9 STEAM labels with suggestions from Claude on which categories to scrape and how to avoid noise in the fetched data.

For extracting text from the sections of each article, Claude suggested a Python wrapper for the Wikipedia API. This let me fetch each article as a section-structured record. To keep noise down, I limited the crawl to one sub-category level deep and only kept articles above a certain content size.

# Seed categories grouped by STEAM domain.
SEED_CATEGORIES = {
    "Crafts & making": [
        "Category:Crafts",
        "Category:Origami",
        "Category:Pottery",
        "Category:Kites",
    ],
    "Arts": [
        "Category:Folk art",
        "Category:Textile arts",
        "Category:Indigenous art",
        "Category:Masks",
    ],
    "Science": [
        "Category:Ethnobotany",
        "Category:Food preservation",
        "Category:Gardening",
    ],
# ... Media arts, Engineering, Mathematics, Music & sky, Play & learning
}

MAX_DEPTH = 1             # descend only one sub-category level
MIN_CONTENT_CHARS = 800   # skip stubs (summary + sections)

Filtering the Corpus

The previous step wrote ~19,000 articles during scraping. This step makes sure the content stays relevant to STEAM topics. Relevance filtering itself runs in two stages: removing obvious noise, then semantic filtering.

The first stage drops obvious non-activity content like music, films, TV, biographies, plant/animal species using category, title, and section-heading patterns.

The second, semantic stage converts each article’s title and summary into a vector using a small sentence-transformer model (all-MiniLM-L6-v2). It then compares it against two sets of example sentences: positive and negative anchors.

The positive anchors describe sentences relevant to STEAM activities and the negative anchors describe less relevant ones. Each article gets a score based on how close it sits to the positive examples versus the negative ones, and we keep every article that leans positive. We do this with the sentence-transformers library.

Writing these anchor sentences is the most human step in the process. With this filtering, I brought the corpus down to ~6,600 articles.

# Filtering the raw scrape to articles useful for STEAM activity suggestions.

POSITIVE_ANCHORS = [
    "a hands-on craft that children can make using simple materials and a technique",
    "a traditional cultural art or making technique such as weaving, carving, pottery or paper folding",
]
NEGATIVE_ANCHORS = [
    "a species of plant, animal or fungus",
    "a biography of a person",
    "a city, region, building or geographic place",
]

    # Embed article + anchors, then keep whatever leans positive.
    pos_sim = util.cos_sim(emb, pos).max(dim=1).values # closest positive anchor
    neg_sim = util.cos_sim(emb, neg).max(dim=1).values # closest negative anchor
    scores = (pos_sim - neg_sim).tolist()

Generating Training Pairs

The next step is to generate input → output training pairs from the filtered corpus. We do this by distilling it through a pretrained, local open-source model (Qwen 2.5 7B, running via Ollama).

For each article, you send the model the title, summary, cultural context, and a few content sections. You also send it a system prompt that explains the task, specifies the output format (valid JSON, in this case), and includes one example training pair to anchor the format.

Constructing this prompt well is where human intervention matters most: the schema, the rules, and that single worked example are what determine the quality of every pair the model generates.

After generation, we cleaned and prepared the pairs for fine-tuning. The local model tended to invent its own category labels (“Ceramics,” “Crafts & Making,” “Circuits (metaphorical)”…). So this step maps every category onto the app’s fixed set of 10 canonical categories (Art, Science, Coding, Circuits, Engineering, Storytelling, Drama, Film, Music, Nature), clamps each activity’s age range into the K12 band, converts the pairs into chat format, and finally splits the data into three sets: train, validate, and test.

// The schema every generated training pair must match (valid JSON only).
{
  "input": {
    "materials": ["3-6 realistic classroom materials"],
    "age_range": [min_int, max_int],
    "theme": "optional string or null"
  },
  "output": {
    "ideas": [{
      "title": "catchy, max 60 chars",
      "description": "2-3 sentences",
      "category": "one of: Art, Science, Coding, Circuits, Engineering, ...",
      "cultural_origin": "specific region or culture",
      "materials_used": ["subset of input materials"],
      "materials_missing": ["anything else needed"],
      "estimated_minutes": integer,
      "steps": ["3-6 short steps, one sentence each"],
      "learning_objectives": ["2-4 objectives"],
      "safety_note": "string or null"
    }]
  }
}

Fine-Tuning

This is the step where the model learns how to behave and generate a desired response in the appropriate format. It involves fine-tuning a pretrained model (Qwen2.5-1.5B-Instruct-4bit) via MLX on the dataset using the LoRA technique.

Fine-tuning with LoRA is a cheap and lightweight approach: it doesn’t retrain the whole model, but instead adds a tiny correction layer that adjusts the final behavior while the original model stays frozen.

Given the constraints of this project — working on a personal laptop with a small dataset of ~400 pairs — full fine-tuning would have needed significantly more memory and compute, which would be overkill here. So LoRA was the right choice.

The LoRA Fine-Tuning Cycle

LoRA (Low-Rank Adaptation) works by inserting small trainable matrices into specific layers of the frozen base model. During training, only these matrices are updated, which dramatically reduces the number of parameters that need to be learned. The result is a lightweight adapter that can be swapped in or out without modifying the underlying model weights.

The fine-tuning loop runs for a set number of iterations, feeding batches of training pairs to the model and adjusting the LoRA adapter weights based on how far the model’s output drifts from the expected output. Validation loss is tracked throughout to catch overfitting early.

Once training is complete, the adapter weights are merged back into the base model and exported as a standalone model file ready for inference.

Evaluating the Fine-Tuned Model

Evaluation compares the fine-tuned model against the base model across a held-out test set. The key metrics are format compliance (does the output parse as valid JSON matching the schema?), field-level accuracy (are the category labels, age ranges, and material lists correct?), and output quality (are the project ideas coherent and culturally grounded?).

The fine-tuned model consistently outperforms the base model on format compliance, since the base model has no knowledge of the expected schema. Quality scores improve as well, though with a dataset of ~400 pairs the gains are modest — enough to validate the approach, but not a production-ready result.

Building the Index & RAG Retrieval

Rather than relying solely on the fine-tuned model to recall facts from training, the integration uses Retrieval-Augmented Generation (RAG). At query time, the app retrieves the most relevant Wikipedia article chunks from an index and passes them as context to the model alongside the user’s search criteria.

Building the index means embedding all ~6,600 filtered articles using the same sentence-transformer model used during filtering, then storing those vectors in a FAISS index for fast approximate nearest-neighbor search. The index is built once and saved to disk.

At retrieval time, the user’s input (materials, age range, optional theme) is embedded and compared against the index. The top-k most similar article chunks are returned and injected into the model’s prompt as reference material, giving it grounded, specific cultural content to draw from when generating ideas.

Integrate the Model with the Feature

On the frontend, the feature is a search form that collects the educator’s available materials, the intended age range, and an optional theme. On submit, it calls a backend endpoint that runs the full RAG pipeline: embed the query, retrieve relevant chunks, assemble the prompt, run inference, parse the JSON response, and return the structured list of project ideas to the UI.

The TypeScript frontend renders each idea as a card showing the title, description, category badge, cultural origin, estimated time, step-by-step instructions, learning objectives, and any safety note. Materials the educator already has are highlighted; missing materials are listed separately.

The endpoint is designed to fall back gracefully if the model returns malformed JSON: it logs the raw output, returns a user-friendly error, and never crashes the app.

Making Content Safe

Child safety filtering runs as a post-processing step after the model returns its response and before anything reaches the frontend. Each text field in the generated output is checked against a blocklist of terms and a set of heuristic rules. Any idea that triggers a flag is dropped from the response entirely rather than sanitized, since partial sanitization can leave unsafe context intact.

The system prompt also carries explicit instructions telling the model to avoid violence, adult content, dangerous materials, and politically sensitive framing. Prompt-level guidance reduces the frequency of flagged outputs at the source; the post-processing filter catches anything that slips through.

For a production deployment, this two-layer approach — prompt instructions plus output filtering — should be supplemented with a dedicated content moderation API and human review of edge cases.

Conclusion

Training a small LLM for a specific K12 use case is tractable on a personal laptop with modest data and compute, provided you make the right trade-offs: a curated but manageable corpus, a lightweight fine-tuning method like LoRA, and a RAG layer to compensate for the limits of a small training set. The hardest parts are not the model training itself but the upstream decisions — which data to include, how to filter it, how to design the schema, and how to write the prompts that shape every generated example. Getting those right is fundamentally a human judgment call, and it’s where most of the meaningful work happens.

Resources