Quick Answer: Vector 7 earns brand citations on the corpus AI engines train and retrieve from. Two layers matter: Common Crawl's 250-billion-page archive feeds pre-training, while real-time RAG queries reach industry publications, Reddit, G2, and licensed news outlets. 46.7% of Perplexity's top citations come from Reddit, where most agencies have no presence.

Brand-to-corpus distribution hub diagram with spokes to Common Crawl, Reddit, TechCrunch, G2, Wikipedia, and podcast transcripts - Vector 7 Distribute - Formative Digital
Vector 7 of the 12 Vectors. Sacred-geometry diagram of the methodology stage.

The hardest brand to land in an AI Overview is the one whose content lives only on its own website. Owned media is necessary but not sufficient. AI engines are trained on, and retrieve from, a much wider corpus than any single brand controls, and the content that earns confident citations is the content that has been validated externally, talked about across multiple sources, indexed by Common Crawl, picked up by industry press, mentioned on Reddit threads where actual buyers congregate, and reviewed on G2 or Capterra by users with no relationship to the brand.

This is the work of Vector 7. The on-domain vectors that came before, the diagnostic, the entity anchor, the prompt inventory, the writing, the citations, the schema, all build a brand surface that is ready to be cited. Vector 7 is the work that earns the citations themselves on the corpus AI engines actually consume. The work runs across two distinct distribution layers, neither of which most agencies execute fully.

The Two Surfaces You Have to Distribute Onto

AI engines source citations from two structurally different layers, and the distribution work for each is different. The pre-training layer is the corpus models learn from during their training cycles, dominated by Common Crawl's open web archive plus licensed datasets from major news organizations and structured data sources. Content placed here propagates slowly, on training-cycle cadence (quarterly to annually depending on the model), but produces durable effects: the brand becomes part of the model's parametric memory, available even when the engine is not querying the live web.

The real-time retrieval layer is the corpus the engine queries during answer generation through retrieval-augmented generation. This layer includes industry publications the engine treats as authoritative, Reddit and community forums for experience signals, review platforms (G2, Capterra, TrustPilot, BrightLocal for local SEO), and the open web indexed through the engine's own crawlers. Content placed here propagates fast, often within weeks of publication, but its influence depends on continuing presence and freshness.

A complete Vector 7 distribution program addresses both. A program that addresses only one (the most common pattern in agency PR work, which usually concentrates on industry press while ignoring Reddit and review platforms entirely) wins one surface and loses the other. Reddit, in particular, is the surface most agencies cannot serve because it requires authentic community participation rather than press-release distribution, and most PR teams are structurally unequipped for it.

The Common Crawl Pre-Training Layer

Common Crawl, the open web archive that dominates LLM pre-training, holds roughly 250 billion pages collected over more than a decade. The dataset feeds into ChatGPT's pre-training, Claude's pre-training, Llama, Mistral, and most open-source foundation models. A page that is not in Common Crawl is not in the training data; a page that has been there for years has been seen by multiple model generations and accumulates parametric weight that newly-published content cannot match.

Three implications for distribution work follow. First, age is an asset; older, well-cited pages carry training-data weight that newer pages have to earn. This is why long-running brand websites with consistent identity and clean URL structure outperform recently-rebranded sites at the parametric layer even when the new sites have stronger schema. Second, presence in cited sources matters as much as presence in own-domain content; if TechCrunch wrote about the brand in 2020 and that article is in Common Crawl's archive, the brand has training-data presence regardless of what the brand's own website says now. Third, robots.txt and the AI-crawler directives are the technical floor; blocking GPTBot, ClaudeBot, PerplexityBot, or Google-Extended in robots.txt is the most common cause of zero AI citations, and the fix is one line of configuration.

Matt Griffin, Formative Digital: "We audit robots.txt before we audit anything else, and roughly a third of the time we find the brand's developer team disabled AI crawlers two years ago to 'protect content' and never reversed the decision. The brand has been invisible to AI training data for two years and nobody noticed because the SEO reports kept showing organic rank. The cost is the entire pre-training visibility window. Fixing it is the cheapest single Vector 7 task on the list, and it changes the floor for everything else."

The Real-Time RAG Layer: Industry Press, Reddit, Reviews

Real-time RAG retrieval is where Vector 7 distribution work shows up fastest. The four high-leverage channels for service-business brands are industry press, community forums (Reddit primarily), review platforms, and earned podcast or video coverage that gets transcribed and indexed.

The Real-Time RAG Channel Map

  • Industry press: Search Engine Land, Search Engine Journal, MarTech, Marketing Brew, and the vertical-specific trade publications for the brand's niche. Earned coverage here feeds Perplexity citations within days and AI Overview citations within weeks. The standard is real expert quotes, named author byline, original data or perspective, not generic press releases.
  • Reddit: r/SEO, r/marketing, r/smallbusiness, plus the niche-specific subreddits for the brand's category. Roughly 46.7% of Perplexity's top citations come from Reddit. Distribution here requires authentic participation, not promotion; brands that show up only to drop links get banned, brands that contribute substantively over months get woven into the citation network.
  • Review platforms: G2, Capterra, TrustPilot, BrightLocal, Google Reviews, BBB Canada. AI engines query these directly when answering buying-intent questions. Encouraging satisfied clients to write detailed, specific reviews on these platforms is the distribution channel most agencies under-execute.
  • Podcasts and video: AI engines increasingly consume podcast transcripts (especially when transcribed and posted to indexed pages) and YouTube captions. A 30-minute interview where the founder discusses methodology, named tools, and specific case data produces a citation candidate that text-only PR cannot match.

The Reddit point is worth emphasizing. The Perplexity citation distribution shows the engine pulling community-validated content at roughly the same rate it pulls major industry publications, and the implication is direct: a brand that has industry press coverage but no Reddit presence is competing on half the surface. The investment is participation, not promotion, and the returns compound over months rather than days.

The Technical Floor: robots.txt, llms.txt, and AI Crawlers

Before any distribution work matters, the technical floor has to be in place. The two configurations that determine whether AI engines can read the brand's content at all:

Robots.txt directives for AI crawlers. The relevant user-agents to allow are GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google's AI training opt-in), CCBot (Common Crawl), and Anthropic-AI. Each one should either be explicitly allowed or omitted entirely (default-allow); explicit Disallow directives are the most common cause of zero AI citation visibility on otherwise well-built sites. A short audit of robots.txt is the single highest-leverage Vector 7 task and takes about ten minutes.

llms.txt is the emerging standard for telling AI models what the brand is about. The format is a structured Markdown file at the site root that describes the brand, its products, its methodology, and the canonical URLs for each topical area. Adoption is still early in 2026 but rising; well-formed llms.txt files reduce hallucination risk and give AI models an authoritative summary that cuts through inferred-from-prose understanding. The implementation cost is roughly one hour for a service-business site; the upside is asymmetric and growing.

From Distribute to Refresh: The Vector 7 Handoff

Vector 7 is the distribution stage; Vector 8 is the freshness stage. The handoff is the recognition that earned citations decay if they go stale. A 2022 TechCrunch article that mentioned the brand was a powerful Vector 7 asset in 2023; by 2026 it is a moderately-weighted training-data residue that the engines partially discount as old. New distribution work is required continuously to keep the citation network refreshed.

The downstream measurement vector (Vector 11 Measure) reads the distribution outcomes by tracking which sources the engines cite when answering brand-direct and category prompts. When the cited sources start including the press placements, the Reddit threads, the review platforms, and the podcast transcripts the brand has earned, the Vector 7 work has landed. When the cited sources are still competitor domains and old aggregator listings, the distribution work has not yet reached the engines' retrieval layer and needs more time, more channels, or both.

Frequently Asked Questions

What is Common Crawl and why does my brand need to be in it?

Common Crawl is the largest public web archive, containing roughly 250 billion pages, and it is the foundational pre-training corpus for ChatGPT and most large language models. A brand whose content does not appear in Common Crawl is harder for AI engines to learn about during training and harder to cite confidently during retrieval.

How important is Reddit for AI citations?

Very important for Perplexity specifically. Roughly 46.7% of Perplexity's top citations come from Reddit, where community-driven discussion provides the experience signal Perplexity weights heavily. ChatGPT also cites Reddit but proportionally less. For both engines, presence in industry-relevant subreddits is a measurable distribution channel.

Should I block AI crawlers like GPTBot and ClaudeBot?

Almost never for service businesses. Blocking AI crawlers is the most common reason for zero AI citations. Unless the brand has specific reason to keep content out of training data (proprietary information, paywalled premium content, copyright concerns), the right default is to allow GPTBot, ClaudeBot, PerplexityBot, and Google-Extended access in robots.txt.

What is llms.txt and do I need one?

llms.txt is an emerging standard, similar to robots.txt or sitemap.xml, that gives AI models a structured summary of what your site is about, what your products do, and how to reference you. Adoption is still early, but well-structured llms.txt files reduce hallucination risk and produce a clear authoritative source AI models can ingest. The cost to implement is low; the upside is asymmetric.

Do AI engines actually cite TechCrunch or Harvard Business Review more than my own site?

Yes for first-mention contexts, especially in pre-training-derived answers. Major industry publications carry training-data weight that owned media usually cannot match. A feature in TechCrunch, a quote in Search Engine Land, or a case study in HBR are training-corpus events that compound over years and influence how AI models describe the brand long after the article was published.

How long until distribution work shows up in AI citations?

Real-time RAG retrieval surfaces (Perplexity, AI Overviews) reflect new distribution within weeks. Pre-training-driven answers (large parts of ChatGPT, Claude) lag by training-cycle cadence, typically quarterly. The honest framing for clients is that the bulk of measurable Vector 7 effect lands between months three and twelve, with pre-training propagation extending into year two.

Sources

  1. Common Crawl Foundation. From SEO to AIO: Why Your Content Needs to Exist in AI Training Data. commoncrawl.org
  2. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
  3. SparkToro (2026). How Can My Brand Appear in Answers from ChatGPT, Perplexity, Gemini, and Other AI/LLM Tools? sparktoro.com
  4. Surfer SEO (2025). 7 Tips to Get Cited by LLMs Like ChatGPT, Perplexity, and Google's AI Answers. surferseo.com
  5. Search Engine Land (2026). Generative engine citation behaviour and source distribution analysis. searchengineland.com
  6. llms-txt.org (2025). The /llms.txt file specification. llmstxt.org

Audit Your Distribution Floor

Formative Digital, Brantford, Ontario

This is Vector 7 inside the Formative Forces delivery system. Vector 7 follows Vector 6: Structure and feeds Vector 8: Refresh. The on-domain work that came before is necessary; the off-domain distribution work is the half most agencies skip and the half AI engines actually retrieve from. The cheapest place to start is the robots.txt audit, the Reddit presence audit, and the review-platform inventory. Each of the three takes under an hour and produces measurable lift inside the next crawl cycle.

Request Your Vector 7 Distribution Audit