Skip to content
Lawrance Reddy

12 min read Flagship

Pfula — an isiZulu AI assistant for South African government services on Azure AI Foundry

A bilingual (isiZulu / English) government-services assistant built on Azure OpenAI and Foundry Agent Service — where the knowledge base is the tool-set and citizen-facing streaming is first-class.

  • Azure AI Foundry
  • Foundry Agent Service
  • Azure OpenAI
  • isiZulu
  • Social good
  • Government services

Pfula is a bilingual (isiZulu / English) assistant that walks South Africans through SASSA, Home Affairs, SARS, UIF, municipal services, the Deeds Office, and CIPC processes — in their own language, on a phone, with escalation letters generated on request. The name is Xitsonga for “to open”, because government services should be open to everyone.

Who Pfula is for

Every South African has a government-services story. You wait four hours at Home Affairs for an ID that never arrives. Your SASSA grant is rejected with no reason given. Your eThekwini municipal bill triples overnight. You phone the call centre and the call centre phones back a different number three days later. The people most likely to need government services — the unemployed, the elderly, the first-time taxpayer, the new small-business owner — are the least likely to have the time, the airtime, or the literacy in bureaucracy-English to navigate it unaided.

Pfula’s target user is that person. The shape is deliberately familiar: a WhatsApp-style chat, answered in isiZulu when the user writes in isiZulu and in English when they don’t. The assistant knows the real office addresses, the real form numbers, the 0800 numbers that actually pick up, the appeal deadlines that actually apply, and the specific South African legislation that a well-written complaint letter should cite. When the conversation escalates from “help me understand this” to “help me do something about it,” Pfula can generate a formal complaint letter — correctly addressed, with the right Act cited, with a 14-business-day deadline, ready to print or email.

Pfula was publicly demoed at the Data & AI Community Day Durban: AI Unplugged event on 14 March 2026. This post is about the architecture underneath the demo — how Pfula is built on Microsoft Azure AI Foundry, what that shape buys, and where it is going next.

Why Foundry is the right shape

Four things about Pfula’s problem shape make Azure AI Foundry the right fit rather than an alternative inference stack.

First, data residency. Pfula handles real South African citizen queries — even without persisting sensitive data, the inference path should run in a region the South African government recognises as local. Azure OpenAI through Foundry runs in South Africa North. That is a narrative consideration more than a strict legal one, but narrative matters when the target user is government-adjacent and the funders most likely to back the work are state or state-aligned.

Second, identity posture. Managed identity is Azure’s killer feature for public-sector workloads: long-lived API keys never have to appear in the running surface, and the hosting environment’s system-assigned identity authenticates to the model endpoint directly. For a project that has to pass due-diligence conversations about how services authenticate to each other, “the application carries no long-lived credentials in production” is a materially shorter answer than any alternative.

Third, the knowledge base fits Foundry’s mental model. Pfula’s knowledge base is seven JSON files, one per government service. Foundry Agent Service has a first-class notion of function tools: the agent decides which function to call, the function returns the data, and the agent reasons over the result. That shape — “which of seven service playbooks is this complaint about?” — is exactly what agent-service function tools are for.

Fourth, the MVP angle. Building on Microsoft AI rather than around it is directly aligned with the AI Services (Foundry) contribution area and with the Microsoft-aligned funder pipeline that a social-good pilot like Pfula lives inside.

How the conversational path is built

The conversational path — /api/chat for the request/response UI and /ws/chat/{conv_id} for the WebSocket-streaming UI — is the hot path. It needs token-level streaming and low latency. It does not need tool calls; the system prompt carries the persona and the detected service’s knowledge section, and the model emits plain text.

The Azure OpenAI client is constructed once and shared across requests:

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AsyncAzureOpenAI

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)
client = AsyncAzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)

The non-obvious piece here is DefaultAzureCredential. It is a credential chain — in production it resolves to the App Service or Container App’s system-assigned identity; on a developer laptop it resolves to az login; inside a CI pipeline it resolves to workload identity federation. One credential construction, three environments, no if-branches in application code. That is the shape that makes managed identity practical.

Request assembly folds the system prompt into the messages array as the first message with role="system", followed by the rolling conversation history:

openai_messages = [
    {"role": "system", "content": request_system_prompt},
    *messages,
]
response = await client.chat.completions.create(
    model=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
    max_tokens=1024,
    messages=openai_messages,
)
assistant_message = response.choices[0].message.content or ""

Streaming is where the citizen-facing latency story lives. On a 3G connection with a 400 ms round-trip, a user who sees the first words of a response within 600 ms experiences a conversation; a user who waits three seconds for a complete response experiences a form submission. The streaming handler unpacks server-sent-event chunks and forwards each delta to the WebSocket:

stream = await client.chat.completions.create(
    model=AZURE_OPENAI_CHAT_DEPLOYMENT,
    max_tokens=1024,
    messages=openai_messages,
    stream=True,
)
async for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta.content
    if not delta:
        continue
    full_response += delta
    await websocket.send_json({"type": "stream", "content": delta})

Two small shapes are worth calling out. The if not chunk.choices guard catches role-only chunks that legitimately have no choices array. The if not delta guard catches the role-assignment chunk that has an empty content delta. Both are normal parts of the SSE protocol; missing either of them crashes the stream at random.

Managed-identity-first auth

The committed default for authentication is managed identity. DefaultAzureCredential resolves to the hosting environment’s system-assigned identity in production and to az login credentials on a developer laptop. The App Service or Container App’s identity gets two role assignments — Cognitive Services OpenAI User on the Azure OpenAI resource and Azure AI Developer on the Foundry project — and that is the entire auth story. No key rotations to schedule, no secret-scanning false positives, no “oops, this leaked in a screen-share.”

An AZURE_OPENAI_API_KEY environment variable is supported as a documented fallback for local development, because az login inside Docker Desktop can be finicky. The .env.example is explicit: do not set it in production. The production posture is that there is no long-lived credential between the application and the model endpoint — the bearer token is minted on demand from the DefaultAzureCredential token provider, and rotates automatically.

This is the single most load-bearing operational decision in the project. Every due-diligence conversation with a potential funder starts with some version of “how do your services authenticate to each other?” A one-sentence answer — managed identity, no long-lived credentials — closes that part of the conversation and opens the next one.

Escalation letters on Foundry Agent Service

The /api/escalation-letter endpoint is the piece that justifies the rest of the architecture.

Letter generation is inherently agentic: decide which service the complaint belongs to, fetch that service’s canonical knowledge, emit a structured letter. Each step is a tool call. The model is the orchestrator.

The agent wiring lives in backend/agent.py. The tool surface is seven function tools, one per government service:

def lookup_sassa() -> str:
    """Return the SASSA knowledge section — grants, appeals, escalation bodies."""
    return json.dumps(_load_kb("sassa"), ensure_ascii=False)

def lookup_home_affairs() -> str:
    """Return the Home Affairs knowledge section — IDs, passports, civil records."""
    return json.dumps(_load_kb("home_affairs"), ensure_ascii=False)

# ...and five more: UIF, SARS, eThekwini Municipality, Deeds Office, CIPC.

Each function is a plain synchronous callable with a descriptive docstring. The docstring is not documentation for humans — it is the tool description the model reads when choosing which tool to call. That is why the docstrings are written in the tool-selection register (“Return the SASSA knowledge section — grants, appeals, escalation bodies”) rather than the function-documentation register (“Loads the SASSA JSON from disk and returns it.”). The model’s decision quality is bounded by the docstring quality.

The agent is created once per process and cached:

toolset = ToolSet()
toolset.add(FunctionTool(functions=_SERVICE_TOOLS))

agent = project.agents.create_agent(
    model=os.getenv("FOUNDRY_AGENT_MODEL", "gpt-4o"),
    name="pfula-escalation-agent",
    instructions=_AGENT_INSTRUCTIONS,
    toolset=toolset,
)

…and the endpoint collapses to a single delegation:

letter = await generate_escalation_letter_via_agent(
    service_type=service_type,
    problem_description=problem_description,
    citizen_name=citizen_name,
    citizen_id=citizen_id,
    reference_number=reference_number,
)

The agent reads the complaint, decides whether this is a SASSA problem or a Home Affairs problem or a municipal problem, calls the correct lookup_* tool, reads the KB section it just fetched, and returns the structured letter. The four-key contract — recipient, subject, body, legislation_cited — is enforced by the agent’s instructions, not by a fragile single-shot prompt.

Knowledge-base-as-tools: the central design choice

The most important design decision in Pfula’s architecture is where the knowledge base lives in the control flow.

The conversational path detects the service in Python (a keyword-matching function), fetches that service’s JSON, and injects it into the system prompt. That is a deliberate cost choice — seven services at ~7–10k tokens each would blow out the per-request input budget, so injecting only one section keeps the rate-limit footprint reasonable. Keyword detection on a chat turn is acceptable because the downstream generation is additional context for a persona, not a decision the model needs to explain.

Escalation-letter generation is different. There, the agent chooses which tool to call, and if it isn’t sure it can call two. The agent also commits to the choice in a way that prompt injection never does — because a tool call is an explicit artefact visible in the run log, not an opaque attention pattern. When a complaint touches Home Affairs, the Deeds Office, and potentially SARS — for example, “can I get a certificate of my late father’s estate?” — the agent run log shows which tools got called. If the produced letter cites the wrong legislation, the run log reveals which KB section the model grounded against.

Put differently: prompt injection is how you give a model context; tool calls are how you give it affordances. Context is fine when the work is generative. Affordances are better when the work is decision-making.

Operating notes from building on GPT-4o

Three things are worth flagging for anyone building a similar bilingual agent on Foundry.

GPT-4o’s isiZulu has a specific register. The model leans formal and slightly literal in isiZulu, more than the warmth a citizen assistant needs. The fix is tuning the persona examples in the system prompt — small, colloquial turns of phrase that steer the model toward the warmth the user experience needs. The lesson is that persona tuning is a first-class step for any non-English agent, not a cosmetic one.

Strict schemas at the boundary beat strict schemas in the middle. The agent instructions specify legislation_cited as an array of strings. Most runs comply; occasionally the model returns a single string. The backend normalises at the boundary: if the field is a string, wrap it in a list. Two lines of code, and a reminder that boundary-layer resilience is cheaper than prompt discipline in the long run.

The agent shape keeps the application small. The escalation-letter endpoint is around twenty lines of delegation code. The system prompt no longer needs to say “return ONLY JSON, no markdown fences, no preamble” — the agent handles structured output. The caller-side parsing is a single json.loads with a defensive markdown-fence strip, and even that only runs occasionally. Less code because the SDK does more of the work. That is usually a good trade, and it is an especially good trade when the extra SDK work is grounded, auditable, and agentic rather than magical.

What comes next

Pfula is a pilot, not a product. The next milestones, roughly in order:

  • Voice input. Azure Speech recognition in both English and isiZulu, for users who can speak more comfortably than they can type. This is a natural pairing with the Azure Neural TTS (en-ZA-LeahNeural) that already ships.
  • A WhatsApp Business entry point. The UI simulates WhatsApp; the obvious next step is to hang the agent off an actual WhatsApp Business number so users don’t have to visit a website at all.
  • An evaluation harness. There is qualitative confidence that the current build does what it should. There is no quantitative data yet. Before any claim about impact can be made honestly, there has to be a tracked set of held-out citizen queries and human-rated responses, across both languages and all seven services.
  • Deployment into a real citizen context. A stage demo is not an app. The harder work — the partnership with an NGO, the consented telemetry, the moderation, the fallback to a human — is the work that turns Pfula into something that actually opens doors.
  • MCP as the knowledge-base update seam. The KB currently lives as JSON files in the repo. Exposing those files as a Model Context Protocol server lets the policy team (paralegals, social workers — the people who actually understand whether the SRD threshold has changed) edit the knowledge base without a git commit and redeploy.

Closing

The architectural posture of Pfula — Foundry-native, managed-identity-first, agent-orchestrated, knowledge-base-as-tools — is the posture every small public-interest AI project should start from. Managed identity instead of long-lived keys. Data residency in a region the target user recognises as local. Tools as affordances for decision-making, prompt injection only for generative context. A consolidated Microsoft estate that the MVP and funder surfaces already speak to.

Pfula is small, it is a pilot, and it may or may not become the thing that helps the next generation of South Africans navigate the system that wasn’t designed for them. But the shape is the shape that gives it the best chance of surviving contact with the real world — and, if this post gives another builder a shortcut to that shape, it has done its job.

Pfula was demoed at Data & AI Community Day Durban: AI Unplugged on 14 March 2026. It is built on Azure AI Foundry as part of the MVP 2026 submission under AI Services (Microsoft Foundry).