Grounding agents in government policy — lessons from building Pfula

RAG is not enough when the documents lie

South African government documentation has a specific failure mode that does not show up in the examples most grounding tutorials start with. The documents do not just have gaps. They actively disagree with each other. A SASSA circular from 2023 will specify one income threshold for the SRD grant; a departmental FAQ published six months later will quietly update that threshold; a printed leaflet at a local office — which is the document the actual grant applicant is holding — will have the older number on it. All three are, in some sense, “official.” None of them is the answer a frontline assistant should give.

Pfula is an isiZulu / English AI assistant for South African government services — SASSA, Home Affairs, UIF, SARS, eThekwini Municipality, CIPC, the Deeds Office. When somebody types “awusebenzi kahle, kwase kuphele izinsuku eziyi-14” into it, the product has to do three things in sequence: figure out which service this complaint is about, pull the authoritative policy for that service, and produce a response (or a formal escalation letter) grounded in that policy and nothing else. The central engineering question is where the “authoritative” in “authoritative policy” lives.

A naive RAG setup is a bad answer to that question. If you shove ten thousand pages of PDFs into a vector store and retrieve the top-k chunks, the ranker will happily surface the 2023 circular alongside the 2025 update alongside the stale leaflet, and the model will produce a fluent, plausible, subtly-wrong synthesis of all three. You cannot audit which chunk produced which sentence, because the prompt is one undifferentiated wall of retrieved text. And you cannot tell the model “prefer the newer one” in any way that actually sticks across turns.

The move Pfula makes — and the move that Foundry Agent Service quietly encourages — is to stop thinking of the knowledge base as a pile of documents and start thinking of it as a set of tools.

Tools-as-knowledge-base

Pfula’s knowledge base is not a vector index. It is seven hand-curated JSON files, one per government service. Each file is a strict tree — grants, eligibility, application steps, timelines, escalation bodies, legislation — edited by somebody who understands the policy rather than by a scraper. The total surface area is about four thousand lines of structured content, not ten thousand pages of unstructured text.

Each file is exposed to the escalation agent as a single function tool whose whole job is to return that file as a JSON string:

def lookup_sassa() -> str:
    """Return the SASSA knowledge section —
    grants, appeals, escalation bodies."""
    return json.dumps(_load_kb("sassa"), ensure_ascii=False)


def lookup_home_affairs() -> str:
    """Return the Home Affairs knowledge section —
    IDs, passports, civil records."""
    return json.dumps(_load_kb("home_affairs"), ensure_ascii=False)

# ... and so on for UIF, SARS, municipal, deeds_office, CIPC

Those functions are registered as Agent Service function tools:

toolset = ToolSet()
toolset.add(FunctionTool(functions=_SERVICE_TOOLS))

project.agents.create_agent(
    model=os.getenv("FOUNDRY_AGENT_MODEL", "gpt-4o"),
    name="pfula-escalation-agent",
    instructions=_AGENT_INSTRUCTIONS,
    toolset=toolset,
)

Two things change when you do this.

The first is that the model’s decision about which policy to use becomes a first-class, legible step in the run. The agent run log shows called lookup_sassa — not “retrieved three chunks with cosine similarity > 0.82.” If the produced letter cites the wrong legislation, you can see exactly which knowledge section the model pulled, because the model had to pick one and only one, by name, in the course of its run. That is the difference between “this agent made a mistake” and “this agent made a mistake and we can audit why.”

The second is that the knowledge base becomes composable by the policy owner, not the ML engineer. A social worker who understands SASSA appeals writes a new entry into sassa.json under escalation.next_step_if_rejected. The agent picks that up on the next run because the tool reads the file fresh. There is no re-indexing, no embedding cost, no “does this new chunk retrieve well against the old query distribution” anxiety. The unit of update matches the unit of authorship.

Why one tool per service, not one tool with a `service` enum

The first version of Pfula’s agent layer had a single tool, lookup_service(service_slug: str), with a service_slug parameter restricted to an enum of the seven services. That was simpler and uglier and I regretted it within a week.

The reason to prefer one tool per service is that the model’s tool descriptions are the routing signal. A well-named tool with a descriptive docstring is a much more reliable way for the model to decide what the user is asking about than a parameter on a general-purpose tool. When the tools are distinct, every function’s docstring reads to the model like a domain label: “use me if this is about unemployment benefits,” “use me if this is about municipal rates.” When the tool is general, you collapse all seven labels into one parameter description, and the model’s selection accuracy sags.

The second reason is operational. Adding an eighth service — say, the Department of Employment and Labour’s compensation fund — is a one-file change under the tools-as-tools shape: add lookup_compensation_fund(), include it in _SERVICE_TOOLS, redeploy the function app, the agent picks it up on next create. Under the single-tool-with-enum shape, adding a service is an enum change in the tool signature, which changes the tool’s hash, which makes the model invalidate its cached understanding of that tool, which occasionally produces a week of “why is it ignoring the new service” bugs while things settle. Not a theoretical concern.

The third reason is that the agent’s instructions can talk about these tools by name:

“Before drafting, decide which service the complaint belongs to and call the matching lookup tool (for example, lookup_sassa for a grant complaint).”

That sentence, in plain prose inside the agent instructions, is doing serious work. It gives the model a concrete example of the routing pattern using the actual tool name it will see in its tool list. That kind of instruction does not exist — is not even coherent — when the tool is a single generic lookup.

Where translation lives in the pipeline

isiZulu user input, English knowledge base. The natural-but-wrong answer is to translate the knowledge base once, offline, and serve Zulu content to Zulu users. That answer is wrong for two reasons.

First, SASSA does not publish in isiZulu. The authoritative policy is in English. A Zulu translation of SASSA policy is a translation, not authority; if you serve that translation as authoritative, you have laundered a translation choice into a grounding claim. If a Zulu-speaking user’s case ever goes to legal review, the reviewing attorney will work from the English source, not your translation. Your Zulu grounding has to reduce to English grounding or you are opening a legal hole.

Second, the vocabulary drift is real. “Ungeziwe” (unemployed) is used across isiZulu dialects but the formal SASSA phrasing is “ongasebenzi” in some regions. If you translate once at the knowledge base layer, you pick one dialect choice at build time and it is wrong for the other. Translation at turn boundaries picks the dialect the user is already using, because the model has their input as grounding for its own output.

So translation in Pfula lives at the turn boundary. The user’s isiZulu message reaches the model as-is. The model decides in English whether this is a SASSA issue, calls lookup_sassa, reads the English policy, and produces a response. If the original input was Zulu, the response is rendered in Zulu at generation time; if it was English, English out. The policy section the model grounded against is the English one, every time. That is auditable in a way that a pre-translated knowledge base is not.

The honest tension here is latency. Passing Zulu input to a model that has to reason in English and respond in Zulu is slightly slower than serving Zulu output from a Zulu corpus. In practice the difference is low hundreds of milliseconds, and the trade for “you can always defend the authority of what you said” is trivial.

The “I don’t know” policy

The hardest piece of prompt engineering in the whole product is not “be helpful.” It is “say ‘I don’t know’ cleanly when the knowledge section does not contain the answer, and route to the Public Protector by default.”

If you do not actively design for this, you get a confident, plausible, entirely-hallucinated escalation address. The model has seen enough generic South African civic infrastructure in its pretraining to guess at a Public Protector office and a complaint hotline, and it will do so with the same fluent tone it uses for the correct answer. A pilot where that happens twice in the first week dies. Real users call those numbers. Real users write to those addresses.

Pfula’s agent instructions pin this down explicitly:

“If the knowledge base does not contain an escalation body for the service in question, say so in the letter and route it to the Public Protector as a default (contact details in lookup_sassa under the escalation stanza). Never invent legislation or escalation addresses.”

Two things to note. The model is told where to find the default contact — the escalation stanza of the SASSA KB, which we canonically populate with the Public Protector’s authoritative contacts. So even the “default” route is grounded in a looked-up tool result, not model memory. And the explicit “never invent legislation or escalation addresses” clause gives the model permission to produce a letter that admits a gap, which is better than a letter that invents one.

This is also where the strict JSON contract earns its keep. The agent returns:

{
  "recipient": "...",
  "subject": "FORMAL COMPLAINT: ...",
  "body": "...",
  "legislation_cited": ["Section 6 of PAJA"]
}

If legislation_cited is empty, the downstream code can surface that to the human reviewer as “the model could not find applicable legislation — please check the KB.” That is a recoverable product state. “The model confidently cited the wrong act” is not.

MCP as a seam for updating the knowledge base

The KBs currently live as JSON files in the repo. That works, but it has a ceiling: every policy update requires a git commit, a review, and a redeploy. Paralegals and social workers — the people who actually understand whether the SRD threshold has changed — are not in that loop.

The natural next step, which Foundry supports through Model Context Protocol (MCP) endpoints, is to expose the knowledge base as an MCP server. A small CMS — Airtable, a Dataverse surface, a purpose-built admin tool — becomes the source of truth. A thin MCP wrapper exposes each service’s KB as a named tool resource. Foundry Agent Service consumes the MCP endpoint, and the agent’s tool list is now sourced from wherever the policy team actually authors. A SASSA threshold change is an edit in the CMS; the next agent run sees it; no pipeline fires.

MCP is the right seam here because it speaks to both sides of the problem. For the agent, MCP tools are indistinguishable from local function tools — same schema, same invocation path, same audit trail in the run history. For the policy team, it is a plain HTTP-shaped integration they can hand to a junior developer or a low-code platform. Nothing about the agent’s grounding story changes; everything about who can update it does.

What this pattern generalises to

The tools-as-knowledge-base pattern is a specific answer to a specific problem: the knowledge is small enough to curate by hand, stable enough that “read the whole service’s file” is a sensible tool contract, and political enough that authorship matters as much as retrieval. That description fits most frontline government-services assistants, but it also fits clinical protocols in a specific specialty, compliance runbooks for a single firm’s jurisdictions, and first-responder scripts for a specific region’s procedures.

It is a bad answer for genuinely large, fast-changing corpora — legal case law, academic literature, commodity-price history. Those still want vector retrieval. What the Pfula experience suggests, though, is that a surprising number of “knowledge base” problems in social-services and government-services AI are not actually retrieval problems. They are curation problems that have been misdiagnosed as retrieval problems because vector stores were the cheapest tool to reach for. Foundry Agent Service gives you a second option, and for this class of problem it is the better one.

RAG is not enough when the documents lie#

Tools-as-knowledge-base#

Why one tool per service, not one tool with a service enum#

Where translation lives in the pipeline#

The “I don’t know” policy#

MCP as a seam for updating the knowledge base#

What this pattern generalises to#