How AI Engines Choose Sources: The Research Behind GEO
Contents
- Why this matters more than vendor blog posts
- The four-stage pipeline
- Stage 1: Retrieval (DPR, ColBERT, ColBERTv2)
- Stage 2: Re-ranking and the position bias problem (Liu 2023)
- Stage 3: Attribution and how LLMs decide what to cite (Bohnet, Gao, Menick)
- Stage 4: Verification and hallucination filtering (Manakul, Min)
- Why niche brands need explicit grounding (Kandpal 2023)
- What this implies for GEO practice
- What this implies for AI visibility measurement
Why this matters more than vendor blog posts
Most "how AI search works" content on the open web in 2026 was written by SEO agencies citing other SEO agencies, with very few primary-source citations to the academic literature that actually defines how these systems behave. The result is an echo chamber of confidently asserted tactics whose underlying research is often misrepresented or absent.
The academic literature is open, available on arXiv, and surprisingly readable for the operationally relevant findings. This article walks through the four mechanisms AI engines use to select sources, citing the primary papers for each one. The optimization tactics that Formative Digital ships in client work map directly to these papers, not to vendor opinion. The point is not academic posture; it is that an SEO program built on the actual mechanisms is more durable than one built on someone's blog post about someone else's blog post.
The four-stage pipeline
Modern AI search engines (ChatGPT Search, Perplexity, Google AI Overviews, Gemini, Microsoft Copilot, Apple Intelligence) follow a recognizable four-stage pipeline. The labels and exact implementations vary by engine, but the conceptual structure is consistent because the same academic literature underlies all of them.
- Retrieval. The engine pulls a candidate set of pages or passages relevant to the user query. This is the search step proper.
- Re-ranking. The engine orders the candidate set for the language model's context window. Order matters more than most operators realize.
- Attribution. The model decides which subset of the candidates to cite as it generates the answer.
- Verification. Increasingly, the model performs a self-check on factual claims before finalizing the response, dropping or rewriting unverifiable assertions.
Each stage has a distinct optimization implication. Stages can be optimized for separately and the gains compound. Below, each stage with the primary research and the practical takeaway.
Stage 1: Retrieval, the question of "is your page even in the candidate set"
1 Retrieval
Primary research: Karpukhin et al. (2020) "Dense Passage Retrieval for Open-Domain Question Answering" introduced DPR, the dual-encoder architecture that became the dominant retrieval method for modern AI engines. Khattab and Zaharia (2020) introduced ColBERT, a late-interaction architecture that improved retrieval quality. Santhanam et al. (2021) released ColBERTv2, the production-grade version many AI engines now use.
What the research shows: Modern retrievers compute dense vector embeddings of every passage in the corpus, embed the user query into the same space, and return the passages whose vectors are nearest to the query vector by cosine similarity (or a learned scoring function). This is fundamentally different from classical keyword-match retrieval. A page can rank for a query whose exact terms it does not contain, if the meaning is close enough.
Operational implication: Pages need to cover the conceptual territory of a query, not just contain its keywords. The Aggarwal et al. (2023) GEO paper found that semantic-relevance optimization (covering related concepts, not stuffing exact-match terms) outperformed pure keyword tactics in nine out of nine tested domains. Vector 4 (Resonate) codifies this in our methodology: map the conceptual neighborhood of every target query, not just the literal phrasing.
Why your page might be invisible at this stage: Robots.txt blocking the engine's crawler, JavaScript-rendered content the crawler cannot parse, slow time-to-first-byte that causes timeouts, or topical thin-ness so that the embedding does not match anything users search for. The first three are technical; the fourth is content depth.
Stage 2: Re-ranking, and the position-bias problem nobody warns you about
2 Re-ranking
Primary research: Liu et al. (2023) "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172) is the most operationally important AI search paper most operators have never read. Liu and colleagues tested how language models perform when relevant information appears at different positions in long input contexts. The finding: model accuracy drops substantially when the relevant information is in the middle of the context, even though the information is identical in every other respect.
What the research shows: The position bias is U-shaped. Models attend to the beginning of context (highest accuracy) and the end of context (next highest), and lose accuracy on information placed in the middle. This holds across model sizes and architectures tested. The effect is large enough to flip the answer to whether the model gets a question right.
Operational implication: When an AI engine retrieves multiple candidate passages and stuffs them into the model's context window, the order matters. Passages at positions 1 and N are more likely to influence the answer than passages in the middle. This is the academic foundation for the lead-with-answer pattern that the Aggarwal GEO paper found produced 30 to 40% citation lift: leading with the answer puts your most extractable content at the top of every passage the engine pulls.
Practical translation: Every cornerstone we ship at Formative Digital opens with a 40 to 60 word direct answer block. The position bias documented by Liu et al. is the reason this works, not vendor folklore. Pages that bury the answer at section seven get retrieved but underperform at the re-ranking stage because the model attends most to the first hundred words of each candidate passage.
Stage 3: Attribution, the question of "does the model cite your page even when it has it"
3 Attribution
Primary research: Three papers are essential here. Bohnet et al. (2022) "Attributed Question Answering" formalized the evaluation of attribution in LLMs and gave researchers a framework for measuring whether a model's claims are actually supported by cited passages. Menick et al. (2022) "Teaching language models to support answers with verified quotes" (the GopherCite paper from DeepMind) showed how to train models to cite specific quotations when answering. Gao et al. (2023) "Enabling Large Language Models to Generate Text with Citations" introduced the ALCE benchmark and demonstrated practical citation-generation methods.
What the research shows: Attribution is a separate skill from generation. A model can have the correct answer in its context and still fail to cite the source it pulled it from, or fabricate a citation, or cite a different source. Modern AI engines have improved citation behavior because they are trained explicitly on attribution. Liu et al. (2023) "Evaluating Verifiability in Generative Search Engines" (arXiv:2304.09848) measured attribution quality across Bing Chat, Perplexity, NeevaAI, and YouChat and found that fluency and verifiability were inversely correlated in some engines; the most fluent answers were also the most likely to contain unsupported claims.
Operational implication: Pages that are easy to cite get cited more often. The structural properties that make a page easy to cite are: short extractable passages with clear quote boundaries, named-author attribution that the model can attach a credibility signal to, and verifiable factual claims with primary-source links the model can interpret as a citation. The GopherCite work specifically rewards quotation-supported answers; the architecture of modern engines rewards content that quotes well.
Why direct quotes work: The Aggarwal GEO paper independently found that "Quotation Addition" (inserting quoted statements from named authorities) was among the highest-lift optimization methods tested. This is not a coincidence; the Menick GopherCite training rewards models for grounding answers in quotes, so models trained that way preferentially draw quotable passages from the candidate set. Pages with named-expert quoted statements are structurally easier for the attribution layer to consume.
Stage 4: Verification, the silent filter that drops half your candidates
4 Verification
Primary research: Manakul et al. (2023) "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" demonstrated that LLMs can detect their own hallucinations by sampling multiple responses and checking consistency. Min et al. (2023) "FActScore" introduced a fine-grained method for measuring factual precision at the atomic-fact level. Ji et al. (2022) and Zhang et al. (2023) provide comprehensive surveys of hallucination types and detection methods.
What the research shows: Modern AI engines increasingly run a verification pass before finalizing answers. Claims that the model judges as low-confidence or contradicted by other retrieved sources get filtered or rewritten. The selection pressure is asymmetric: a candidate page that produces a high-confidence verifiable answer outranks one that produces a fluent but uncertain answer at this stage.
Operational implication: Anthropic Claude weights factual accuracy heavily; we covered this at Claude SEO Optimization. The reason is that Claude is trained with Constitutional AI methods (Bai et al. 2022, arXiv:2212.08073) that emphasize harmlessness and accuracy over fluency. A single verifiable factual error on an otherwise good page reduces citation probability significantly. ChatGPT and Perplexity are slightly more forgiving but still apply verification pressure.
What this means for content: Cite primary sources, not secondary blog summaries of primary sources. Use specific dated statistics rather than vague "studies show" attributions. Validate every quote against the actual source. The verification stage rewards content that survives independent fact-checking and discounts content that does not.
Why niche brands and small businesses need explicit grounding
One paper in the new library deserves its own section because its operational implication is so direct. Kandpal et al. (2023) "Large Language Models Struggle to Learn Long-Tail Knowledge" (arXiv:2211.08411, ICML 2023) tested whether LLMs accurately recall facts about entities of varying popularity in the training corpus. The finding: model accuracy on factual recall correlates strongly with the frequency of the entity in the training data. Common entities (major brands, famous people, well-documented historical events) are recalled accurately. Niche entities (small businesses, regional brands, less-documented topics) are recalled inaccurately or hallucinated.
The empirical curve is steep: a brand mentioned 100 times in the training corpus is recalled with high accuracy; a brand mentioned 10 times produces frequent errors; a brand mentioned 0 to 1 times is functionally invisible. For a typical small business in 2026, the natural training-corpus footprint sits in the low range without deliberate intervention.
The operational implication is the entire reason the Wikidata-anchoring playbook works. Wikidata serves as structured ground truth that AI engines can read deterministically, bypassing the long-tail knowledge gap. We covered this in detail at Wikidata as AI Truth Infrastructure; the academic foundation is Kandpal et al. (2023). Schema-rich pages with verifiable factual claims serve the same function: they give the verification stage a strong signal that prevents the long-tail accuracy collapse.
What this implies for GEO practice
Mapping the four stages to actual content decisions:
For Stage 1 (retrieval), ensure the page is technically retrievable (robots.txt allows AI crawlers, content renders without JavaScript, server response time stays under 2 seconds) and topically substantial (covers the conceptual neighborhood of the query, not just exact-match keywords). Vector embedding retrieval rewards depth over keyword density.
For Stage 2 (re-ranking), lead with a 40 to 60 word direct answer at the top of every page. Liu et al. (2023) shows position matters; Aggarwal et al. (2023) shows the lead-with-answer pattern produces measurable citation lift. Both findings point to the same operational behavior.
For Stage 3 (attribution), structure content for citation. Named expert authorship (Person schema with credentials), inline primary-source citations (4 to 8 per cornerstone), quoted statements from named authorities (the Quotation Addition lift in Aggarwal), and clear quote boundaries the model can extract verbatim.
For Stage 4 (verification), rigorous fact-check discipline. Every statistic dated and cited to its primary source; every quote verified against the actual statement; every named expert checked for accurate attribution; speculation labeled as speculation rather than asserted as fact. Constitutional-AI-trained models (Claude specifically) penalize verifiable factual errors aggressively.
For long-tail entity grounding, Wikidata anchoring (the highest-leverage one-time GEO move per our step-by-step guide) plus connected JSON-LD schema graphs that give the verification layer structured ground to fact-check against.
What this implies for AI visibility measurement
Each pipeline stage produces a different observable signal that an AI visibility tracker can measure.
Stage 1 (retrieval) observability: server access logs show whether AI crawlers are visiting your priority pages. PerplexityBot, GPTBot, OAI-SearchBot, ClaudeBot, Google-Extended user agents in your logs at meaningful frequency means Stage 1 is working. Their absence means Stage 1 is broken before any other optimization matters.
Stage 2 (re-ranking) observability: position-aware visibility. When your brand is in an AI engine's answer, what position does it occupy? First-mentioned brands in a comparative answer have higher click-through than brands mentioned third or fourth. Tracking platforms like Profound, Otterly, AthenaHQ surface this; the methodology behind why it matters comes from Liu et al. (2023).
Stage 3 (attribution) observability: citation rate per mention. When your brand is mentioned, what fraction of those mentions include a clickable source link? Mention without citation produces no traffic; mention with citation produces referral. The gap is the attribution layer's selectivity.
Stage 4 (verification) observability: factual accuracy of brand-related claims. AI engines sometimes hallucinate facts about your brand; the verification stage normally suppresses these but does not always succeed. Quarterly prompt-battery audits (manual or automated) detect verification failures so you can correct the upstream training data via Wikidata edits, About-page updates, and earned-media corrections.
The full measurement framework is at Tracking AI Citations: Vector 11. The methodology is at How to Measure AI Visibility. The tool landscape is at Best AI Visibility Platforms.
For the broader 12-Vector framework that synthesizes all four pipeline stages into actionable program design, see The 12 Vectors. For the foundational definition of GEO that started this whole literature, see What is Generative Engine Optimization. For our team to run a program built on this research foundation, see Formative Digital services.
Primary sources cited (12 papers)
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2023). "GEO: Generative Engine Optimization." arXiv 2311.09735.
- Karpukhin, V., et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." arXiv 2004.04906.
- Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." arXiv 2004.12832.
- Santhanam, K., et al. (2021). "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." arXiv 2112.01488.
- Liu, N. F., Lin, K., Hewitt, J., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv 2307.03172.
- Bohnet, B., Tran, V. Q., Verga, P., et al. (2022). "Attributed Question Answering." arXiv 2212.08037.
- Menick, J., Trebacz, M., et al. (2022). "Teaching language models to support answers with verified quotes." arXiv 2203.11147.
- Gao, T., Yen, H., Yu, J., & Chen, D. (2023). "Enabling Large Language Models to Generate Text with Citations." arXiv 2305.14627.
- Liu, N. F., Zhang, T., & Liang, P. (2023). "Evaluating Verifiability in Generative Search Engines." arXiv 2304.09848.
- Manakul, P., Liusie, A., & Gales, M. JF. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection." arXiv 2303.08896.
- Min, S., et al. (2023). "FActScore: Fine-grained Atomic Evaluation of Factual Precision." arXiv 2305.14251.
- Kandpal, N., Deng, H., Roberts, A., Wallace, E., & Raffel, C. (2023). "Large Language Models Struggle to Learn Long-Tail Knowledge." ICML / arXiv 2211.08411.