Before You Reach for a Vector Database: Keyword Retrieval in the Agent Era

For the past couple of years, the moment anyone said “RAG,” the first thing that came to mind was a vector database. The recipe goes roughly like this:

Chunk the docs → turn them into embedding vectors → store in a vector DB → at query time, find the most similar chunks

It’s a powerful approach, especially good at “finding semantically similar content.” For example, if you ask:

Is there any section that talks about the “employee offboarding process”?

even if those exact words never appear — the doc instead says “account deactivation,” “access revocation,” “handover procedure” — vector search might still find it.

But lately, in the world of AI agents — coding agents in particular — a curious thing has surfaced:

Often, the most useful search tool isn’t a vector database. It’s plain old grep. The humble keyword search.

It sounds counterintuitive. We have LLMs, embeddings, vector databases — why reach back for grep? The answer isn’t that grep is smarter than embeddings. It’s that, given how agents work, grep has a few crucial advantages:

It’s simple.
It’s fast.
It’s (mostly) explainable.
It fails relatively clearly.
It can be used, corrected, and verified by the agent over and over.

That’s the theme of this post:

In the age of AI agents, good retrieval isn’t necessarily the most complex retrieval — it’s the retrieval that’s easiest for an agent to use, correct, verify, and trace.

A caveat up front: this “grep moment” is a 2025-to-early-2026 observation. By mid-2026 the conversation had already moved on — the return of keyword search is just one small piece of a bigger thing (context engineering), and on the hardest coding tasks, semantic retrieval is quietly coming back. More on that in sections 8 and 9.

1. Vector search is powerful, but it’s not the default answer to everything

Let’s be clear: this post is not arguing embeddings are useless. Embeddings are very useful, especially for: finding content close in meaning to a passage.

For example:

The user doesn’t know the exact wording used in the docs.
There are lots of synonyms.
The question is conceptual, not keyword-shaped.
The corpus is large and you need semantic recall first.
The docs say the same thing many different ways.

In these cases, vector search is valuable.

But a lot of internal docs, technical docs, manuals, and code search aren’t that kind of problem. Often the user is asking:

How do I use this API?
Where does this error message appear?
How do I apply for an IP-MAC address?
Where are shared-folder permissions configured?
Who calls this function?
Which document mentions this field?

These don’t necessarily need “semantic similarity.” What they need is:

exact hit → find the source → verifiable → citable

And that is exactly what keyword search, grep, and BM25 are good at.

2. Why do coding agents still lean so heavily on grep?

Start with coding agents. When a coding agent modifies code, it usually doesn’t know the answer up front. It explores the codebase like an engineer would. For example:

1. Search for the error message
2. Find the relevant file
3. Read the function definition
4. Search for who calls this function
5. Find the test file
6. Edit the code
7. Run the tests
8. If it fails, search with different keywords

The point of this flow isn’t “nail the perfect answer in one search.” It’s:

search → read → reason → verify → search again

That’s agentic retrieval. In this style of work, grep shines because its feedback is very clear for corpora like code, English, and error strings.

If you search:

grep "PermissionDenied" -r .

and it hits, the agent sees the filename, line number, and context. If it misses, the agent usually knows that path is a dead end and can try:

grep "permission denied" -ri .
grep "access denied" -ri .
grep "403" -r .

That kind of failure, in a code corpus, is actionable. The agent doesn’t just know it failed — it has a rough idea of how to fix the next step. That matters a lot.

By contrast, when many complex retrieval systems fail, the feedback is murky. Vector search returns a pile of “looks similar” passages, but if the answer isn’t among them, the agent can’t easily tell whether it was:

a bad query?
the wrong embedding model?
chunks too big? too small?
top-k too small?
the data was never indexed at all?
the reranker mis-ranked?

grep’s strength isn’t that it’s sophisticated — it’s that, on code-like corpora, it’s relatively transparent.

⚠️ An honest caveat: grep’s “explainability” isn’t unconditional. For long Chinese sentences, a zero-result doesn’t necessarily tell you whether the cause was bad tokenization, the user’s synonym, or a page that was never indexed. In other words, grep’s feedback is very actionable on English/code, and much less magical on Chinese natural language — which is exactly why section 5 has to deal with tokenization specifically.

3. grep isn’t smarter — it’s better suited to being operated by an agent

Avoid one misunderstanding here: I’m not saying grep is technically superior to embeddings.

grep doesn’t understand meaning. It doesn’t know “offboarding” might relate to “access revocation,” or that “request an account” and “create a user” might be the same thing. So of course grep has limits.

But in many agent workflows, grep’s advantage is that it makes a great tool. It’s like a pocket knife — not flashy, but:

quick to pick up
intuitive to use
you can see exactly where it cuts
easy to correct if you cut wrong
you can cut many times in a row

For an agent, this matters enormously. An agent’s power doesn’t come from a single retrieval — it comes from iterating:

Search 1: nothing
Search 2: different keyword
Search 3: narrow the path
Search 4: found the relevant file
Search 5: trace the call chain
Search 6: confirm the tests

That flow is natural with grep / ripgrep / find / glob. It’s why some practitioners now describe agents as converging on the filesystem itself as their primary interface: if each doc page is a file and each section a directory, then grep, cat, ls, and find are most of what an agent needs — no separate retrieval service required. So the more precise statement is:

grep matters again not because it understands more, but because it’s simple, transparent, and iterable — which happens to fit how AI agents work. It moves complexity out of “offline index engineering” and into the “online reasoning loop.”

4. So what is BM25, and how does it relate to grep?

If grep is the most intuitive keyword search, BM25 is a slightly more advanced version.

grep mostly answers: “Does this word appear? On which line?” BM25 goes further: “Which passage is most likely relevant to the query?”

It considers a few things:

Whether the query terms appear (term presence)
How many times — but with diminishing returns (term frequency, which saturates)
Whether those terms are rare (IDF — rarer is more discriminative)
Whether document length skews the judgment (length normalization)

Here’s a detail many people overlook but is the soul of BM25: point 2’s “how many times” isn’t linear — it saturates.

That is, a term’s 2nd occurrence adds a fair bit over the 1st; but the 20th adds almost nothing over the 19th. BM25 deliberately uses a saturation function to cap the contribution of high-frequency terms. This is exactly why BM25 beats naive TF-IDF: a passage that “stuffs the keyword 50 times” doesn’t get rocketed to the top. For the “cover page / keyword-farm page pollutes ranking” problem in section 6, this saturation already blocks part of it — but not all of it, which is why you still need data governance.

For example, query “shared folder request” against two passages: passage A mentions “folder” once; passage B has “shared,” “folder,” “request,” and “permission.” BM25 usually ranks B first — not because B has more words, but because it hits more rare-and-relevant terms.

So you can think of BM25 as: keyword search that ranks better than grep and isn’t fooled by keyword stuffing. It has no embedding-style semantic understanding, but it’s transparent, fast, cheap, and a great fit for term-dense, closed document sets.

That’s why a lot of internal-doc retrieval doesn’t need a vector database from day one. SQLite FTS5 ships a built-in bm25() ranking function — a solid baseline with zero external dependencies.

5. The hard part of Chinese docs: it’s not search that fails, it’s tokenization

English search is easier because words are separated by spaces. shared folder request naturally splits into shared / folder / request.

But Chinese has no spaces. How should 共用資料夾申請 (“shared folder request”) be split? It could be 共用 / 資料夾 / 申請, or a system (e.g., SQLite’s default unicode61 tokenizer) might mistakenly treat it as one long token 共用資料夾申請.

If the whole sentence becomes a single token, then a user searching for 資料夾 (“folder”) may fail to find 共用資料夾申請. This is a classic pitfall of Chinese full-text search.

There’s more than one solution

To be clear: there are several routes for Chinese tokenization — there isn’t a single “correct answer”:

Word-level segmentation: plug in a Chinese segmenter like jieba or CKIP to split 共用資料夾 into 共用 / 資料夾. Closest to the language, but adds a dependency, and segmenters make mistakes too.
n-gram: use FTS5’s built-in trigram tokenizer, or roll your own bigram. Somewhere in between.
unigram (character-by-character): index Chinese one character at a time. The crudest, with the fewest dependencies.

This post demonstrates the last one, because it best embodies the “grep spirit” — simple, controllable, zero external dependencies. The approach: insert a space between every Chinese character, indexing Chinese one character at a time.

共用資料夾  →  共 用 資 料 夾

So when a user searches 資料夾, the system also turns the query into 資料夾; both sides split the same way, making a hit much more likely.

But note that English and codes must not be split blindly. IP-MAC, ARES, API_KEY should stay intact — they’re proper nouns, and breaking them up hurts precision. So a better approach is:

Chinese: split per character
English & codes: keep intact
At Chinese↔English boundaries: insert a space

申請IP-MAC位址  →  申 請 IP-MAC 位 址

Be clear-eyed about the cost of unigram

unigram is not a free lunch. Its biggest cost: recall widens, and false hits increase. Because 資料夾 will match any passage containing those three characters — even if they’re nowhere near each other and have nothing to do with “folder.”

So unigram can’t be used alone; it needs two companions to be accurate:

AND-chunking: require the characters of a single term to hit adjacently / together as much as possible (the two-stage query in section 6).
BM25 ranking: use the saturation + IDF from section 4 to push the genuinely relevant passages up.

In other words, unigram converts the “tokenization” problem into a “ranking and query-strategy” problem. In many closed-document scenarios that’s a worthwhile trade — but it is a trade-off, not “the way Chinese must be done.” Because what we’re after isn’t perfect language understanding, but: stable hits, easy debugging, explainable results, and low build cost.

An irony worth naming: grep’s best virtue is weakest exactly here

There’s a deeper tension this post should own rather than bury. grep’s headline selling point — the one sections 2 and 3 lean on — is transparency: clear, actionable failure. But that virtue is weakest precisely where this post’s main use case lives.

On English and code, a zero-result is usually an actionable dead end: try a synonym, a different flag, a broader path. On Chinese natural language, a zero-result is ambiguous — it could be bad tokenization, a synonym mismatch, or a page that was never indexed — and unigram’s wide-recall / high-false-hit profile only muddies the signal further. So the honest takeaway is not “lexical search is cleanly debuggable, therefore use it for Chinese docs.” It’s the opposite gradient:

The cleaner the corpus (code, English, dense proper nouns), the stronger the pure-lexical case. The messier and more semantic the language, the sooner you should reach for embeddings and a reranker.

For Chinese enterprise docs specifically, “add embeddings” probably comes earlier than the English-centric source material — most of which is about code and English — would suggest. Worth keeping in mind before you commit to a lexical-only baseline for a Chinese corpus.

6. How might you implement this?

Say we’re building an internal document search system; the sources might be PDFs, slides, Markdown, manuals, technical docs, internal FAQs. We can design a very lightweight pipeline.

1. Process documents offline

First, turn documents into structured data: document name, page number, section title, page text, and a page screenshot where needed. Slides or PDFs can be processed page by page. This step matters because later we want results to trace back to their original source.

2. Clean the data and exclude noise

Not all content belongs in the search index. Cover pages, table-of-contents pages, section dividers, pages with only a big heading and no content, pages that repeat lots of keywords with no real information. If indexed, even with BM25’s saturation, they tend to dilute ranking (they may still hit some rare terms). So a better approach:

Main table keeps every page          → still searchable, can jump to a specific page
FTS index only holds pages with real content → doesn't pollute BM25 ranking

This stage is also a good time to do PII redaction: scrub names, employee IDs, internal emails, and phone numbers before they hit the database. The engineering effort grep philosophy saves you is exactly what you spend on this kind of quality-affecting data governance.

3. Handle Chinese tokenization

Before writing to FTS5, process the text:

def cjk_space(text: str) -> str:
    # Chinese: split per character
    # English & digits: keep intact
    # Insert a space at Chinese↔English boundaries
    ...

Use this function when indexing, and use the same function for user queries. This is critical — if indexing and querying split differently, you get the “it’s clearly in the doc but search can’t find it” problem.

4. Query exact first, then loosen

Search runs in two stages. Stage one uses stricter conditions (AND within a chunk, OR across chunks): the main tokens of the user’s input should all hit if possible. If found, return directly.

If nothing is found (common with long Chinese sentences), loosen to a pure-OR query and let BM25 rank: candidates qualify by hitting some of the tokens. This balances precision, recall, and tolerance for long user inputs.

More importantly, results can be tagged:

matched_via = "and"          → exact hit, more reliable
matched_via = "or-fallback"  → found after loosening, use more conservatively

This little field is useful to an agent: it tells the upper layer how reliable a result is, so it can decide whether to search again. This is precisely “actionable feedback.”

7. What really drives RAG quality is often not the retrieval algorithm — it’s how results are presented

People building RAG tend to pour all their attention into: which embedding model? what chunk size? how many top-k? reranker or not? Those matter. But in practice, something else matters just as much, often more: how do you hand the search results to the model?

This isn’t me talking off the cuff. Research that put grep and vector search head to head (Sen et al., 2026) found that how the tool output is fed to the model matters at least as much as which retrieval method you pick — with the same underlying data, returning results inline versus writing them to a file the model reads back yields very different accuracy. The paper’s own framing is careful: grep generally beat vector retrieval in their comparison, and overall scores depended strongly on the harness and tool-calling style, even on identical data. So the lesson isn’t “format beats method” as a clean ranking; it’s that presentation is a first-class variable that most RAG builders under-weight.

One honest caveat, which I’ll return to in section 9: that study measured retrieval over long conversation memory (the LongMemEval benchmark), not a static enterprise document corpus like contracts, SOPs, or 10-Ks. The presentation lesson travels well across both. The raw “grep beats vector” headline, however, should be extrapolated to enterprise document search more cautiously — the document distribution is genuinely different.

So don’t just dump a wall of text at the model. A better search result looks like this:

Title: VPN request process
Source: it_onboarding.pdf
Page: 12
Matched via: and
Body:
To request VPN access, first complete employee account activation...

For slides or PDFs, you can also attach a screenshot of the corresponding page. And the text and its matching image should be paired and locked on the retrieval side — always from the same result, never mismatched.

This way the model gets not a chaotic blob, but a structured piece of evidence. It knows: which document and page this came from; whether it was an exact hit or a fallback; whether text and image are from the same page; which source to cite when answering. This dramatically lowers the odds the model makes something up.

For enterprise document retrieval, traceability isn’t a bonus — it’s a baseline requirement. Because what users actually need isn’t “the AI looks good at answering,” but: can the AI’s answer be verified after the fact?

8. When should you use grep / BM25, and when embeddings?

A simple test.

Favor grep / BM25 when:

The corpus is small and closed
Users frequently search for exact nouns
The docs are full of product codes, APIs, error messages, field names
Answers need sources attached
The data is mainly text
You want the system to be easy to debug

Then BM25 / FTS5 is probably already a great starting point.

Add embeddings when:

Users ask fuzzy, conceptual questions
There are many synonyms
The corpus is large and needs cross-document synthesis
There are lots of images, tables, screenshots
Keywords alone often fail

Then embeddings / a reranker are well worth adding.

A counter-signal worth stating plainly: on hard code tasks, semantic retrieval is coming back

Let me update a conclusion that’s often misread. “Coding agents all use grep” is a 2025-to-early-2026 observation, but it’s conditional — and the frontier is already shifting.

In harder scenarios — multi-file, cross-language, changes that require understanding call relationships — pure grep hits a ceiling. A concrete data point: on harder benchmarks like SWE-Bench Pro, the average reference solution touches 4.1 files and changes ~107 lines — “you can’t grep your way through these,” as the Augment team put it. And leading code tools like Augment Code, even for seemingly rule-based code search, no longer rely on grep alone — they fine-tune dedicated embedding models for code semantics and build a semantic context engine.

In other words, the right picture isn’t “grep replaced embeddings,” but:

Simple, rule-based, term-dense search → grep / BM25 is enough, even better.
Cross-file search that relies on structure and semantic relationships → semantic retrieval is coming back.

The most practical approach

Many systems end up not picking one side, but mixing:

BM25 handles exact hits
embeddings handle semantic recall
a reranker handles final ranking
the agent handles repeated querying, reading, and verification

That is, not grep vs embedding, but grep + embedding + agentic loop. One thing worth flagging in 2026 stacks: the reranker is frequently the real quality lever — more than the choice between BM25 and embeddings — so it deserves to be treated as a first-class component, not an afterthought you bolt on at the end. The only other rule: don’t assume from the start that a vector database must be the core.

9. The bigger picture: from RAG to context engineering

At this point, we can pull the camera back.

“grep or embedding” is, by mid-2026, no longer the frontier of the conversation. The bigger shift underway: RAG was never the destination, only the start; entering the agent era, “retrieval” is being absorbed into a larger discipline — context engineering.

What’s the difference?

The old question: how do I fetch the most relevant top-k in one shot?
The new question: at each reasoning step, which data, which tools, which memories should the agent put into context — and just as importantly, what should it drop?

Retrieval (grep, BM25, or embeddings) only answers the “what data” part. Context engineering also handles three parts this post barely touched:

1. Memory. This is where the irony from section 7 comes home. The very study used to argue “grep beats vector” ran on LongMemEval — a benchmark that is entirely about long-conversation memory, not document search. That’s a reminder that the grep-vs-vector result was measured in the memory regime, not the enterprise-doc regime, and also that memory is its own layer. Modern agent systems commonly split it:

Semantic memory: knowledge, facts — often vectors + knowledge graphs
Episodic memory: past reasoning traces and outcomes, summarized and stored

so the agent learns from “how it solved this last time” instead of starting from zero every time. This is a dimension a pure-retrieval framing never touches.

2. Context assembly and compaction. This is arguably the heart of context engineering, and it’s the cell most teams underestimate. Even with perfect retrieval, stuffing everything into the window degrades reasoning — the “context rot” problem, where accuracy falls as the window fills with marginally-relevant material. So a real agent loop has to decide, every step, what to keep verbatim, what to summarize, what to evict, and when to compact the running history. Retrieval feeds this; it doesn’t solve it.

3. Structure and provenance (graph / provenance). GraphRAG and knowledge graphs bring “relationships between entities” into retrieval, so answers aren’t just “similar passages” but “evidence with a traceable chain of relationships.” For enterprise scenarios that need traceability and auditability, this is increasingly a hard requirement.

So a fuller mental model is:

Context Engineering
├── Retrieval:        grep / BM25 / embedding / reranker   (mostly what this post covers)
├── Memory:           episodic + semantic
├── Context mgmt:     what to keep, summarize, evict, compact each step (context rot)
└── Tools & Provenance: knowledge graphs, structured sources, verification hooks

The lexical baseline this post describes is the lightweight, solid way to do the “Retrieval” cell of that diagram. But don’t forget the other cells exist — when your agent starts needing to “remember context,” “manage a crowded window,” and “explain relationships,” retrieval alone isn’t enough.

10. What this means for AI engineers

The real value of “Grep is all you need” isn’t telling everyone to abandon new tech. What it really reminds us: don’t over-complicate the problem too early.

Many RAG systems reach for a vector database, an embedding pipeline, a reranker, and a complex chunking strategy from the start — but the real problem ends up being: documents weren’t cleaned, metadata wasn’t designed, sources weren’t kept, Chinese tokenization wasn’t handled, the result format wasn’t model-friendly, there was no feedback on search failure, the agent didn’t know how to fix the next step.

None of these is auto-solved by swapping in a different embedding model. Instead, a simple BM25 baseline often lets you see the system’s real bottleneck faster — because it’s transparent and debuggable, you can clearly tell:

Why was this result found?
Why was that result not found?
Which keyword hit?
Which page polluted the ranking?
Was the user query too vague?

These are precious engineering signals.

11. Conclusion: not a return to old tech, but a re-understanding of retrieval

So — is grep all you need? The answer is: not necessarily.

But if the question is “should I not assume every RAG must start from a vector database?” — the answer is: right, you shouldn’t. A better approach:

Start with simple, transparent, traceable lexical search as a baseline.
When you hit semantic queries, synonyms, multimodal data, large-scale ranking,
or cross-file structural problems, add embeddings / a reranker / a knowledge graph.
And treat "memory" and "how context is assembled" as problems just as important as retrieval.

In the age of AI agents, retrieval is no longer a “fetch top-k once” problem. It’s more like a workflow:

search → read → reason → verify → correct → search again

And tools like grep, BM25, and FTS5 — traditional as they are — fit beautifully into that workflow. Their value isn’t being flashy; it’s being reliable.

To sum it up in one line:

In the agent era, the best retrieval system isn’t necessarily the most complex one — it’s the one easiest to use, correct, verify, and trace. And retrieval itself is only the first cell of a larger context engineering.

References

Sahil Sen et al., Is Grep All You Need? How Agent Harnesses Reshape Agentic Search, arXiv:2605.15184 (May 2026) — the empirical grep-vs-vector comparison. Note its key finding is twofold: grep generally beat vector retrieval and the harness / tool-output format mattered strongly, even on identical data. Note also that its corpus is conversational memory (LongMemEval), not enterprise documents.
Doug Turnbull, Is grep all you need for RAG? — the “agents converging on the filesystem as their interface” framing, the argument that constraints force the agent to work smarter, plus the hooks verification loop.
Jason Liu / Colin Flaherty (Augment), Why Grep Beat Embeddings in Our SWE-Bench Agent — the experience that grep/find was enough on SWE-Bench, with the explicit “smaller repo” caveat.
Augment Code, Auggie tops SWE-Bench Pro — the counter-data: pure grep hits a ceiling on harder tasks (avg. 4.1 files / ~107 lines per fix) while a semantic context engine pulls ahead.
RAGFlow, From RAG to Context — A 2025 year-end review of RAG — context on index-free RAG, and “even for code search, leading products fine-tune code embeddings.”
Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI (Towards Data Science) — the framing shift from RAG to context engineering.