Back to blog
AI
knowledge base
content
How to Build a Knowledge Base Your AI Can Actually Use

How to Build a Knowledge Base Your AI Can Actually Use

The structure, format, and maintenance habits that produce accurate AI chat responses — and the common mistakes that cause hallucinations.

ellix.ai TeamApril 29, 20266 min read

How to Build a Knowledge Base Your AI Can Actually Use

The most common complaint we hear about AI chat: "it gives wrong answers." In 90% of cases, the problem isn't the AI — it's the knowledge base. Vague documentation produces vague answers. Contradictory docs produce confused answers. Outdated docs produce wrong answers.

Here's a complete guide to building and maintaining content that your AI assistant can actually use.

Structure Beats Length

Long, narrative documentation is hard to chunk meaningfully. A 3,000-word product overview produces chunks that mix feature descriptions, pricing notes, and setup instructions together. At retrieval time, the AI gets an unfocused mix of context and produces a meandering answer.

Short, focused documents by topic produce focused chunks. A document titled "How to connect your Shopify store" that covers exactly that topic in 400 words will be retrieved precisely when a user asks about Shopify — and the AI's answer will be accurate.

Rule of thumb: one document, one topic. If you find yourself using "also" or "additionally" frequently, split the document.

Document Hierarchy and Chunk Relevance

Document structure affects retrieval quality in non-obvious ways. A flat hierarchy — a folder full of equally-weighted articles — produces inconsistent results for hierarchical queries.

Consider the difference between these two organizational approaches:

Flat (worse):

  • billing.md — 2,000 words covering plans, payment methods, invoices, cancellation, refunds, and enterprise pricing

Hierarchical (better):

  • billing/plans-and-pricing.md — covers tier features and prices
  • billing/payment-methods.md — covers card, crypto, invoice billing
  • billing/cancellation.md — covers cancellation steps and data retention
  • billing/refunds.md — covers refund eligibility and process

The hierarchical approach produces chunks that are semantically coherent. When a user asks "how do I cancel my subscription?", the retrieval surface is billing/cancellation.md rather than a chunk that happens to include cancellation content buried in a 2,000-word billing overview.

Good vs. Bad Documentation Chunks

The difference between a chunk that helps the AI and one that doesn't is specificity. Here's a concrete comparison:

Bad chunk:

Our billing system offers flexible options for businesses of all sizes. We're committed to providing value at every tier. Contact our sales team to discuss what might work for you.

This chunk contains no retrievable facts. An AI summarizing it produces equally vague output.

Good chunk:

Starter plan: $29/month, up to 500 AI conversations, 1 website widget, 50MB knowledge base storage. Growth plan: $99/month, up to 5,000 conversations, 3 widgets, 500MB storage, team seats (up to 5). Pro plan: $299/month, up to 25,000 conversations, unlimited widgets, 5GB storage, unlimited team seats, priority support.

This chunk answers "what does each plan include?" completely. The AI doesn't need to interpolate or summarize — it retrieves and reports.

The quality of your AI's answers is a direct proxy for the quality of your documentation. If the AI is vague, your docs are vague.

Content Auditing: Finding Stale Content Programmatically

Knowledge bases degrade over time. Pricing changes. Features get renamed. Integration partners come and go. Manual audits don't scale — you need a programmatic approach.

A simple stale content detector:

import httpx
from datetime import datetime, timedelta

async def find_stale_content(tenant_id: str, api_key: str) -> list[dict]:
    """Return documents not updated in 90+ days with low recent query match rates."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.aiassist.chat/v1/documents",
            headers={"Authorization": f"Bearer {api_key}"},
            params={"tenant_id": tenant_id, "limit": 500},
        )
        documents = response.json()["documents"]

    stale_threshold = datetime.now() - timedelta(days=90)
    stale = []

    for doc in documents:
        last_updated = datetime.fromisoformat(doc["updated_at"])
        if last_updated < stale_threshold:
            # Also check if this document is actually being retrieved
            if doc.get("retrieval_count_30d", 0) > 0:
                stale.append({
                    "document_id": doc["id"],
                    "title": doc["title"],
                    "last_updated": doc["updated_at"],
                    "retrieval_count_30d": doc["retrieval_count_30d"],
                })

    # Sort by retrieval count — most-retrieved stale docs are highest priority
    return sorted(stale, key=lambda d: d["retrieval_count_30d"], reverse=True)

Run this weekly. Documents that are both old and frequently retrieved are your highest-priority update targets — they're being served to users regularly and may be wrong.

Handling Multilingual Content

If your product serves multiple markets, your knowledge base needs multilingual content. The embedding approach here matters:

text-embedding-3-small and text-embedding-3-large are multilingual — they produce semantically comparable vectors across languages. A question asked in French will retrieve French-language content more readily than English-language content, even if both are semantically relevant.

This means you can maintain separate document sets per language without needing a translation layer at retrieval time. Our recommendation:

  1. Maintain your canonical documentation in your primary language
  2. Upload translated versions as separate documents, tagged with a language metadata field
  3. At query time, detect the user's language (from browser headers or the message itself) and apply a pre-filter on the language metadata field

This approach keeps retrieval quality high for each language without mixing languages in the context window.

Protected Content and Access Control

Not all knowledge base content should be publicly accessible via the chat widget. Internal pricing notes, customer-specific configurations, and draft documentation shouldn't be surfaced to anonymous visitors.

aiassist.chat supports document-level access tiers:

  • Public — retrieved for any conversation on your embedded widget
  • Authenticated — only retrieved when the chat session is associated with a logged-in user (passed via a signed JWT when initializing the widget)
  • Internal — only retrieved via the API with your server-side API key, not the public widget

Configure the access tier when uploading documents. If a user asks a question that would require internal context to answer fully, the AI will give a partial answer based on public content and can optionally acknowledge that more information is available upon authentication.

Using Analytics to Improve Your Knowledge Base

The aiassist.chat dashboard exposes three reports that are directly actionable for knowledge base improvement:

Unanswered questions report — queries that returned no confident answer (below your confidence threshold). These are knowledge gaps. Sort by frequency to prioritize what to write first.

Low-confidence answers report — queries that returned an answer but with a confidence score below 0.80. These indicate areas where the knowledge base has something but not enough. The answer might be technically correct but too vague to be useful.

Source document performance — which documents are being retrieved most frequently, and what their average confidence score is when retrieved. Documents that are retrieved frequently but produce low-confidence answers have a content quality problem. Documents that are never retrieved may be redundant or poorly structured.

Review these three reports monthly. A knowledge base that was 70% AI-resolved on day one should be 85%+ AI-resolved at 90 days if you act on the data.

What to Upload First

If you're starting fresh: FAQ, pricing page, feature list, integration docs, getting-started guide. These five categories cover 70–80% of pre-sales and early-lifecycle support queries. Start there, then expand based on what questions actually come in — the unanswered questions report will tell you exactly where to focus next.