Master's Thesis · MSc Artificial Intelligence · 2025

Multi-Ontology Augmentation for LLM-Based Cooking Instruction

Does giving an LLM structured knowledge actually make it a better teacher? A multi-agent study of ontology-grounded models, and why the answer turns out to depend heavily on which model you use.

Degree

MSc Artificial Intelligence

Institution

Vrije Universiteit Amsterdam

First Supervisor

Jiahuan Pei

Second Reader

Ilias Gerostathopoulos

3

LLM providers compared

270

Simulated teaching conversations

8

Pedagogical dimensions evaluated

13.9%

Performance drop for Claude 3.5 Sonnet

Smart, but ungrounded and unsafe

Large language models are increasingly used in educational settings, but they operate as black boxes: they hallucinate plausible-but-wrong information, lack grounding in specialised domain knowledge, and offer little transparency for the verifiable, reliable instruction that teaching requires.

Cooking instruction is a sharp test case for this. Most prior work focuses on single-agent recipe generation, but real teaching is interactive and safety-critical: an ingredient substitution suggested without regard to allergens or food safety isn't just unhelpful, it's dangerous. I wanted to know whether grounding an LLM in structured knowledge sources could improve both its reliability and its effectiveness as a teacher.

A multi-agent, ontology-grounded teaching system

I designed a multi-agent conversational AI system in which two LLMs interact as a chef and a trainee, generating structured teaching dialogues. On top of this, I built an ontology-driven ingredient-substitution mechanism that combines the FoodOn ontology with USDA nutritional databases, grounding the model's suggestions in verified, structured data rather than free-form generation.

To measure the effect rigorously, I built a deterministic benchmarking and LLM-as-judge evaluation pipeline, running a controlled comparison of baseline (LLM-only) versus ontology-augmented approaches across three providers (GPT-4 Mini, Claude 3.5 Sonnet, and Grok 3 Mini) over 270 simulated teaching conversations spanning different conversation types, user experience levels, and eight pedagogical dimensions. The framework also tracks tool usage and produces structured analytics for comparing knowledge-augmented LLM systems.

Knowledge integration helps, but not universally

The headline result is that ontology grounding produced provider-specific outcomes rather than a universal improvement. The same structured knowledge made one model better and another markedly worse:

  • GPT-4 Mini showed minimal change with ontology integration (−0.2% to +3.1% across pedagogical metrics), and Grok 3 Mini stayed similarly stable (−3.2% to +0.7%).
  • Claude 3.5 Sonnet, by contrast, experienced a 13.9% overall performance drop, concentrated in ingredient-substitution accuracy (−17.7%) and safety management (−6.6%).
  • Safety Risk Management was the single weakest dimension across every provider (scoring just 2.25 to 2.56 on a 0–5 scale), a consistent blind spot regardless of model.
  • Each provider showed a distinct pedagogical 'personality': GPT-4 Mini adapts its readability to the learner, Claude 3.5 Sonnet uses sophisticated vocabulary suited to advanced users, and Grok 3 Mini maintains consistent simplicity across experience levels.

Why this matters for deploying educational AI

These findings challenge a common assumption that adding structured knowledge will reliably make an LLM more capable. Instead, the effect of ontology grounding is mediated by the underlying model's architecture, so educational-AI systems need provider-specific integration strategies, not a one-size-fits-all knowledge layer.

The work also surfaces safety as the consistent weak point of these systems, and contributes reusable tooling (automated evaluation and structured analysis for comparative assessment of knowledge-augmented LLMs) for the broader study of when, and for whom, grounding actually helps.