Quick Answer: AI engines are non-deterministic, so the same prompt returns different answers and different sources run to run, by design. One Thinking Machines Lab test got 80 unique outputs from 1,000 identical temperature-0 runs. So one screenshot is not a ranking. Honest tracking samples each prompt many times and reports a frequency.

Matt Griffin, Formative Digital: "An agency that shows you one flattering screenshot is showing you one roll of the dice. The honest question is never did we appear. It is how often do we appear, out of how many runs, on what date. Until you measure variance, you are not measuring visibility. You are collecting lucky moments."

Ask ChatGPT "who is the best HVAC company in Brantford" five times in a row and you will likely get five overlapping but different lists. That is not a bug, a cache problem, or a sign your optimization broke between coffee sips. It is how these systems work, and it changes what an honest AI visibility report is allowed to claim. Kevin Indig's early-2026 Growth Memo study of roughly 1.2 million ChatGPT citations found about 44% of citations come from the first 30% of a page, which tells you where to put your strongest material. This piece is about the harder problem the same data implies: the result you are trying to influence moves every time you look at it.

Why does AI give a different answer every time you ask?

AI engines give different answers because they generate text by predicting one token at a time from a probability distribution, and the path through that distribution is not fixed. At each step the model holds a ranked set of likely next words and picks from it. Small differences in how that pick happens, compounded across hundreds of tokens, send two runs of the identical prompt down different sentences. By the time the answer names a business, the two runs have already diverged.

Most existing writing on this describes the mechanism well from the user's seat: token-by-token prediction, the creativity setting, the way an earlier sentence steers the next one. That explanation is correct, and it is also where it stops. What it rarely does is connect the mechanism to the discipline that depends on it. If you are paying someone to track whether AI engines recommend your business, the variance is not trivia. It is the central measurement problem.

The three sources of run-to-run change

Sampling. Most consumer chat interfaces add deliberate randomness to word selection so answers feel natural rather than robotic. This is the part people mean when they say "temperature." Turn it down and answers get more repetitive, but, as the section on temperature 0 shows, not identical.

Conversation context. Your previous messages, any memory the engine keeps, and even the time of day feed into the response. Two users asking the same question get different answers partly because they are not actually asking from the same starting point.

Retrieval and the index. Engines that search the live web before answering, like Perplexity and Google AI Mode, pull a fresh set of pages each time. The web moved, the ranking moved, so the cited sources move. That last source has its own name, citation drift, and it gets its own section below.

Does temperature 0 make the output identical?

No, and this is the finding that reframes the whole topic. Temperature 0 means greedy sampling: at every step the model is told to take the single highest-probability token, which should in theory be perfectly repeatable. It is not. In September 2025, Horace He and the team at Thinking Machines Lab sampled 1,000 completions from the same model at temperature 0 and counted 80 unique outputs. The completions were identical right up until token 103, then split.

The cause is subtle and worth stating plainly, because it kills the most common objection to variance tracking ("just set temperature to zero"). The arithmetic inside the inference server is not batch-invariant. When many users hit the model at once, the system groups requests into a batch, and the batch size changes constantly with server load. Non-batch-invariant kernels produce slightly different floating-point results depending on that batch size. A microscopic numerical difference early in a 100-token answer is enough to flip which token is "highest," and from there the answers diverge. You are sharing a server with the rest of the internet, and the rest of the internet's traffic is leaking into your result.

What the variance research actually shows

  • 80 unique outputs from 1,000 temperature-0 runs (Thinking Machines Lab, Sept 2025). Diverged at token 103. Greedy decoding did not make it deterministic.
  • Only 35% of cited domains repeat between runs of the same prompt in Google AI Mode (SE Ranking, 2026). Roughly two-thirds of the sources in any single answer vanish on the next run.
  • The academic norm is repeated sampling. The Generative Engine Optimization paper (Aggarwal et al., KDD 2024) runs every experiment across multiple random seeds and reports the average, precisely because a single generation is not trustworthy.

Read those three together and a rule falls out. If the people who study these systems for a living refuse to trust one run, a marketing report built on one run has no defence. Treat non-determinism as a measurement problem, not a curiosity.

What is citation drift, and how often do sources change?

Citation drift is the run-to-run change in which sources an engine cites, separate from the change in the prose. The answer text and the source list move independently, and the source list is the part that matters for visibility, because being cited is how a business gets surfaced. SE Ranking's analysis of Google AI Mode found only about 35% of cited domains repeat across runs of the same prompt, with roughly two-thirds dropping out between runs. SE Ranking's own conclusion is blunt: prompt tracking is "directional, not definitive."

This is where Formative Digital's own data comes in, because it shows drift is not only run-to-run within one engine, it is structural across engines. In our May 2026 analysis of 1,732 AI-engine citations across nine Ontario cities, run through DataForSEO, 83.7% of every source an engine cited was unique to that single engine. ChatGPT leaned on google.com, Claude on the curated directory threebestrated.ca, Gemini wrapped everything through Vertex grounding, and Perplexity spread across HomeStars and similar sites. Four engines, asked the same local question, were reading largely different webs. Stack that cross-engine gap on top of the within-engine drift and the result is plain: a brand can be cited reliably by one engine, intermittently by a second, and never by a third, while still "appearing in AI search" in a single lucky screenshot. The screenshot hides which of the three is true.

The full cross-engine breakdown, with the per-engine source fingerprints, is the subject of our study of the AI consensus gap across Ontario engines. For this article the point is narrower: drift is the reason a frequency, not a result, is the only honest unit of measurement.

Why do ChatGPT, Gemini, Claude, and Perplexity each answer differently?

The four major engines answer differently because they were trained on different data, retrieve from different indexes, and run on different inference stacks, so their variance is not even the same shape. Two engines asked the identical question are not two attempts at one true answer. They are two different systems with two different default behaviours and two different ways of being inconsistent.

Some of the difference is the training mix: each model learned from a different slice of the web, so each has different priors about which businesses are notable. Some is retrieval: Gemini grounds through Vertex, Perplexity runs a live search, ChatGPT cites Maps and Knowledge Graph data heavily, Claude favours curated directories. And some is the inference plumbing, the batch-invariance issue covered above, which varies by provider. The practical upshot is that "track our AI visibility" is really four separate measurement jobs, each with its own baseline and its own noise floor. We go engine by engine in our comparison of how ChatGPT and Perplexity actually differ.

What run-to-run variance means for tracking your visibility

Variance means your AI visibility is a probability, not a position, so it has to be reported like one. Classic SEO let you say "you rank fourth for that keyword" and the number held still long enough to be useful. AI search has no equivalent fixed rank. The honest analogue is a mention frequency: out of N runs of this prompt on this engine on this date, your business was named in M of them. That M-out-of-N, with the date attached, is the real unit. Anything vaguer is a screenshot dressed up as a finding.

What a Brantford or Ontario business should actually do

You do not need to run the experiments yourself, but you should know what an honest report looks like so you can tell whether yours is one. Ask your agency three questions. How many times was each prompt run, and over what window? Is the result reported as a frequency with a date, or as a single screenshot? Are the four engines tracked separately, since 83.7% of cited sources in our Ontario data were unique to one engine? An agency that measures variance like a scientist can answer all three without flinching. An agency that screenshots a lucky win cannot. That gap is the whole difference, and it is why a proper AI visibility diagnosis starts with a sampling plan, not a single query.

This is Vector 11, Measure, paired with Vector 12, Iterate. Measure means a stated sample size, a stated date, and a frequency rather than an anecdote. Iterate means re-running the same protocol on the same cadence so the frequency becomes a trend line you can act on. Statistics Canada says this discipline is no longer optional: 12.2% of Canadian businesses used AI to produce goods or deliver services in 2025, roughly double the prior year, with professional and technical services near 31.7%. AI answering is mainstream enough that businesses need to track it properly rather than from one screenshot.

How many times should you run a prompt before trusting it?

Run each prompt at least five times in a session and repeat that weekly over a window of at least 30 days. That is not a number we invented; it is roughly what SE Ranking recommends after studying prompt-tracking stability, about five consecutive runs of a prompt set per persona, tracked for 30 days or more. The academic GEO paper points the same direction from a different angle, averaging across multiple random seeds so that no single generation drives a conclusion. Keep the four engines on separate counts, never blended, since their variance is not even the same shape. Five runs will not give you a perfect probability, but it moves you from "we appeared once" to "we appeared in three of five runs this week, up from one of five last month," which is a sentence you can manage a business by.

How Formative Digital tracks AI answers honestly

Formative Digital tracks AI answers by running each prompt many times across all four engines on a fixed schedule and reporting a mention frequency with the sample size and date in plain sight. The Formative Forces, our orchestrated multi-agent system, makes the repeated sampling affordable at a scale a human-staffed team could not match by hand: many prompts, four engines, five-plus runs each, week after week. The output is not a triumphant screenshot. It is a frequency that goes up or down over time, which is the only honest way to show whether the work is moving the needle.

We saw this discipline pay off with our Brantford retail client, Mattress Miracle, where organic visibility grew from roughly 1,000 to more than 82,400 monthly organic visits (SEMrush, April 2026), tracked as trends rather than as one lucky capture. Results depend on industry, competition, and existing digital presence, and AI visibility will always carry run-to-run noise no method fully removes. The honest promise is not a fixed rank. It is that we measure the variance instead of hiding inside it, and report the frequency either way. You can see the full picture in the Mattress Miracle case study.

Frequently Asked Questions

Does setting temperature to 0 make AI answers identical?

No. Temperature 0 removes the deliberate randomness in word selection, but it does not make the output identical. Thinking Machines Lab sampled 1,000 completions at temperature 0 and still got 80 unique outputs. The reason is that the maths inside the server is not batch-invariant: when other users hit the model at the same time, the batch size changes, and the numbers come out slightly different. So the same prompt can still produce a different answer run to run with no randomness setting at all.

How many times should you run a prompt before trusting the result?

Run each prompt at least five times in one session and track it over time, not once. SE Ranking recommends roughly five consecutive runs of a prompt set per persona, repeated weekly over a window of at least 30 days. The academic GEO paper runs every experiment across multiple random seeds and averages them. The point is the same: report a mention frequency with a stated sample size and date, never a single screenshot.

Why is one screenshot of an AI answer almost worthless as proof?

Because the next run can return a different answer. Research on Google AI Mode found only 35 percent of cited domains repeat across runs of the same prompt, so roughly two-thirds of the sources in any single screenshot may vanish on the next attempt. A screenshot captures one sample from a distribution. It can show that an answer is possible, but it cannot show how often it happens, which is the only number that tells you whether your visibility is real or lucky.

Sources

  1. He, H. / Thinking Machines Lab. (2025, September 10). Defeating Nondeterminism in LLM Inference. 1,000 temperature-0 completions yielded 80 unique outputs, diverging at token 103; root cause is lack of batch invariance in inference kernels. Thinking Machines Lab
  2. SE Ranking. (2026). How to Choose Prompts to Track for AI Visibility. Only 35% of cited domains repeat across runs of the same prompt; prompt tracking is "directional, not definitive"; recommends ~5 consecutive runs weekly over 30+ days. SE Ranking
  3. Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization (arXiv:2311.09735). Experiments run across multiple random seeds and averaged to reduce run-to-run variance. arXiv:2311.09735
  4. Indig, K. / Growth Memo. (2026, February 16). The Science of How AI Pays Attention. Analysis of ~1.2M ChatGPT citations: ~44% come from the first 30% of a page. Growth Memo
  5. Statistics Canada. (2025). Analysis on artificial intelligence use by businesses in Canada, second quarter of 2025. 12.2% of Canadian businesses used AI to produce goods or deliver services in 2025, roughly double the prior year. Statistics Canada

Get Your Free AI Visibility Audit

Formative Digital, Brantford, Ontario

The audit runs your key prompts multiple times across ChatGPT, Gemini, Claude, and Perplexity, then reports a mention frequency with the sample size and date, so you see how often you actually appear rather than one lucky screenshot. You keep the report whether you engage further or not.

See how often your business actually appears