How LLMs Decide What to Cite: The Attribution Research

How LLMs Decide What to Cite, Formative Digital

By Matt Griffin, founder of Formative Digital. Brantford, Ontario. Published 2026-04-28. 2,700 words.

Quick Answer LLM attribution is a separate skill from generation. Bohnet et al. (2022) formalized the evaluation framework. Menick et al. (2022) GopherCite trained models specifically to support answers with verified quotes. Gao et al. (2023) introduced the ALCE benchmark for citation generation. Liu et al. (2023) showed that fluency and verifiability are sometimes inversely correlated in production engines. The operational lessons: pages with quotable passages, named expert authorship, and verifiable factual claims get cited preferentially because the underlying training rewards models for grounding answers in extractable evidence.

Contents

  1. Why attribution is a separate skill from generation
  2. Bohnet 2022: How attribution is measured
  3. Menick 2022 GopherCite: Training models to quote
  4. Gao 2023 ALCE: The citation benchmark
  5. Liu 2023: When production engines actually fail
  6. What this implies for content structure
  7. Anti-patterns the research identifies
  8. Measuring citation quality on your own brand

Why attribution is a separate skill from generation

An intuitive but wrong assumption: a model that can answer a question correctly will also cite the right source. The academic literature shows this is false in the general case. Attribution is its own learned behavior, sometimes orthogonal to factual recall, and modern AI engines that produce well-cited answers do so because they were trained explicitly on attribution objectives.

The implication for content strategy: pages can be retrieved correctly, factually utilized in the answer, and still go uncited if their structural properties do not match what the attribution layer rewards. Understanding what the attribution layer rewards is the gap between knowing your content was relevant and knowing whether the model named you as the source.

Bohnet 2022: How attribution is measured

Bernd Bohnet and colleagues at Google published "Attributed Question Answering" (arXiv:2212.08037) in late 2022. The paper's contribution was operational: it formalized how to evaluate whether an LLM's claims are actually supported by its cited sources, rather than just whether the answer happens to be correct.

Key finding

Models that produce factually accurate answers can fabricate or misattribute citations. Bohnet et al. introduced a rigorous evaluation that separates "did the answer say something true" from "did the cited source actually support what was said." The two scores diverge meaningfully across systems and tasks.

The operational implication: AI engines that show citations to users are running an attribution layer in addition to a generation layer. The attribution layer asks, for each claim in the answer, "which retrieved passage supports this claim?" and links to it. When the attribution layer cannot find a supporting passage, it either drops the citation, hedges the claim, or (worst case) cites a passage that does not actually support it.

Pages that produce extractable claim-passage matches feed the attribution layer cleanly. Pages where the claim is implicit, scattered, or buried in long prose make the attribution layer's job harder, which means lower citation probability even when the page is technically retrievable.

Menick 2022 GopherCite: Training models to quote

DeepMind's GopherCite paper (arXiv:2203.11147) by Jacob Menick and colleagues took a different tack: rather than evaluate attribution post-hoc, train the model to support its answers with verified quotes from the start. The training objective rewarded the model for providing quotes that, when extracted from the source, did in fact support the answer's claims.

Key finding

Models trained on quote-supported answer objectives produce citations at materially higher rates and with materially higher accuracy than models trained on generation objectives alone. The capability is teachable, and once taught, the model preferentially extracts quotable passages from candidate sources.

The strategic implication for GEO content: passages that are structurally easy to quote (short, self-contained, with clear sentence boundaries, with named-author attribution) survive the quote-extraction step. Passages that require the model to paraphrase, combine multiple sentences, or interpret implicit meaning fail the extraction step more often. This is the academic foundation for why the Aggarwal GEO paper found "Quotation Addition" (inserting quoted statements from named authorities) to be among the highest-lift optimization methods. The mechanism is not coincidence; it is the engine architecture.

Practical pattern: every Formative Digital cornerstone includes inline quoted statements from named authorities. The pattern is visible on this very page. The reason is not stylistic; it is that GopherCite-trained engines preferentially extract such statements when generating answers, which means the engine is more likely to surface our content as a citation when it has quotable material to anchor to.

Gao 2023 ALCE: The citation benchmark

Tianyu Gao and colleagues at Princeton published "Enabling Large Language Models to Generate Text with Citations" (arXiv:2305.14627) at EMNLP 2023, introducing the ALCE benchmark. ALCE evaluates LLM citation generation along three dimensions: fluency (the answer reads well), correctness (the answer is factually right), and citation quality (the cited sources actually support the answer).

Key finding

State-of-the-art LLMs in 2023 fell well short of human-level citation quality even on questions they could answer correctly. The gap was largest on long-form answers where the model had to provide multiple citations across multiple claims. Modern AI engines have closed some of this gap through retrieval augmentation, but the underlying attribution challenge persists.

What ALCE makes measurable is the trade-off: a model can produce a beautifully written answer that is factually correct but poorly cited, or a heavily cited answer that reads choppily. Production engines deploy in this trade-off space, and different engines balance differently. Perplexity historically optimizes for citation density at some cost to fluency; ChatGPT historically optimizes for fluency at some cost to citation completeness; Claude varies by context.

The operational implication for content: pages that match what the attribution layer expects (specifically, claims with clear evidentiary anchors) get cited at higher rates because the model can integrate them into the answer without breaking the fluency-correctness-citation balance. This is structural, not stylistic. The Aggarwal "Cite Sources" optimization method (one of the three highest-lift methods in the GEO paper) maps directly to the ALCE methodology.

Liu 2023: When production engines actually fail

Nelson Liu, Tianyi Zhang, and Percy Liang at Stanford published "Evaluating Verifiability in Generative Search Engines" (arXiv:2304.09848) which empirically tested four production AI search engines (Bing Chat, NeevaAI, Perplexity, YouChat) at the time of writing. They measured how often the cited sources actually supported the claims in the generated answers.

Key finding

Across the production engines tested, only about half of generated statements were fully supported by the cited sources, and only about three-quarters of sources actually supported the claims they were cited for. Fluency was inversely correlated with verifiability in some configurations: the most readable answers were also the most likely to contain unsupported claims.

The Liu verifiability paper has two important takeaways for GEO. First, attribution failures are not theoretical; they happen frequently in production engines. Brands that monitor their AI engine mentions occasionally find their pages cited as the source of claims their pages do not actually make. Second, the inverse correlation between fluency and verifiability means the engines themselves face a structural trade-off, and content that is BOTH highly readable AND clearly verifiable resolves the tension in the engine's favor.

How content resolves the tension: short paragraphs with single primary claims, named-author attribution, inline citations to primary sources, and structural cues (h2/h3 hierarchy, FAQ schema, lead-with-answer blocks) that let the engine extract a verifiable claim cleanly without sacrificing the fluency of the answer it composes.

What this implies for content structure

Synthesizing the four papers into operational guidance:

1. Lead with extractable claims, not narrative buildup. Bohnet's attribution evaluation favors content where claims are stated directly and explicitly. Pages that start with "in this article we will explore" require the model to wait for the actual claim, and the attribution layer cannot extract a passage from setup prose. The 40 to 60 word direct answer block at the top of every page is the structural fix.

2. Use named expert quoted statements as evidentiary anchors. Menick's GopherCite specifically rewards quote-supported answers. Pages with inline quoted statements from named authorities give the engine extractable, attributable evidence. The Aggarwal Quotation Addition method is the same finding from the optimization-side angle.

3. Pair every statistic with a primary-source citation. Gao's ALCE benchmark measures whether cited sources actually support claims. Pages where statistics link to the underlying paper or government report (rather than to a secondary blog summary) feed the attribution layer cleaner evidence and survive the verification step better.

4. Maintain readability and verifiability simultaneously. Liu's verifiability finding is that fluency and verifiability sometimes trade off in engines. Content that does both (short paragraphs, single claim per paragraph, inline citation, named author) helps the engine resolve the tension by surfacing your page as the source.

Anti-patterns the research identifies

By implication from the same papers, several content patterns underperform attribution layers.

Multi-claim paragraphs without internal structure. A paragraph that asserts five facts without internal punctuation between them forces the attribution layer to attribute the whole paragraph as a unit, which is too coarse. Break complex paragraphs into single-claim sentences with their own evidence anchors.

Vague attribution language. "Studies show," "research suggests," "experts agree" without naming the study, the research, or the expert. The attribution layer cannot ground "studies show" in a specific source the engine can read. Replace with named studies (Aggarwal et al. 2023), specific researchers (Liu, Zhang, and Liang 2023), and verifiable institutions (Stanford NLP group).

AI-generated boilerplate citations. Citations that the underlying source does not actually contain. Manakul's SelfCheckGPT and Min's FActScore both penalize this pattern at the verification stage. AI-assisted writing can introduce hallucinated citations; human verification of every cited source is the fix.

Quote-style language without actual quotes. Phrases like "according to industry experts" without an actual extractable quote from a named expert. The Menick GopherCite training rewards real quotes, not quote-styled prose. Either include a real quoted statement or do not gesture at one.

Hidden content behind JavaScript or accordions. The retrieval and attribution layers see HTML, not rendered DOM. Content that loads only after a click event is invisible to the attribution layer and cannot be cited even when relevant.

Measuring citation quality on your own brand

The four-paper foundation gives you a measurement framework for your own brand:

  1. Mention rate (Bohnet-style): how often does the engine name your brand in a relevant prompt?
  2. Citation rate (ALCE-style): of the mentions, how many include a clickable source link to your domain?
  3. Attribution accuracy (Liu verifiability-style): when cited, does your page actually support the claim the engine made?
  4. Quote fidelity (GopherCite-style): does the engine quote your content accurately, or does it paraphrase in ways that change meaning?

The first two are commonly tracked by AI visibility platforms (Profound, AthenaHQ, Otterly, BrandRank.AI). The third and fourth typically require manual quarterly audits because automated detection of attribution accuracy is itself a live research problem. We covered the operational side of measurement at How to Measure AI Visibility and the tool landscape at Best AI Visibility Platforms.

For the broader four-stage pipeline of which attribution is one stage, see How AI Engines Choose Sources. For why niche brands face an additional headwind beyond attribution mechanics, see The Long-Tail Knowledge Gap. For the foundational GEO definition, see What is Generative Engine Optimization. For Formative Digital to run the program, see our services page.

Primary sources cited

  1. Bohnet, B., Tran, V. Q., Verga, P., et al. (2022). "Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models." arXiv 2212.08037.
  2. Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., et al. (2022). "Teaching language models to support answers with verified quotes." arXiv 2203.11147 (GopherCite).
  3. Gao, T., Yen, H., Yu, J., & Chen, D. (2023). "Enabling Large Language Models to Generate Text with Citations." EMNLP / arXiv 2305.14627 (ALCE).
  4. Liu, N. F., Zhang, T., & Liang, P. (2023). "Evaluating Verifiability in Generative Search Engines." EMNLP / arXiv 2304.09848.
  5. Aggarwal, P., et al. (2023). "GEO: Generative Engine Optimization." arXiv 2311.09735.
  6. Manakul, P., Liusie, A., & Gales, M. JF. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection." arXiv 2303.08896.
  7. Min, S., et al. (2023). "FActScore." arXiv 2305.14251.