How AI Overviews actually pick which sites to cite (2026)

An AI Overview citation is not a ranking. It is a different selection problem with a different scoring function, and treating it like a "blue link but smaller" is why most pages never get pulled in.

This piece reflects what we see in the citation logs across the domains we run today plus the broader public research Google has published on grounded generation. The signal set is narrower than people think, and almost none of it has to do with what your old SEO checklist told you to do.

Here is what actually decides whether a passage from your page ends up inside a Google AI Overview, based on observed citation data and Google's own published research on grounded generation. Once you know the signals, the practical next step is to earn citations in AI search results on your own pages.

The signals that move the needle

Google's generative answer system does not pick "a page" to cite. It picks passages. A passage is usually a 2 to 5 sentence chunk that directly answers the inferred sub-question of the user's query. The model scores candidate passages against a small set of features. The ones that recur in our citation logs are these five.

1. Information density

The cited passage contains a higher concentration of nouns, named entities, and quantitative claims per sentence than the surrounding text on the same page. This is measurable. Sentences with concrete numbers, named entities, and direct definitional language get cited at roughly 3.4x the rate of sentences that paraphrase the same content with hedging.

Practical version: a sentence like "The 2026 update reduced thin-content visibility by 23% across the affected sites we tracked" outperforms "The recent update significantly impacted thin content" every time, because the first one carries facts the model can ground in.

2. Semantic match

The passage embeds close to the query's intent vector, not just its keyword surface. Two pages can both contain the phrase "best schema for restaurants" but only the one whose surrounding paragraph actually defines the schema types, lists them, and explains when each applies will match the dense vector of someone genuinely asking the question.

Writing for semantic match means answering the literal question in the first 1 to 2 sentences after the heading, then expanding. The model lifts passages that are self-contained answers.

3. Entity disambiguation

If your page mentions "Apple" the model needs to know within a few tokens whether you mean the fruit, the company, or the record label. Pages that establish entity context early (with Wikipedia-style first-mention conventions and Schema.org sameAs links) get pulled into entity-driven queries far more often.

This is the single biggest underrated lever for any brand or person trying to get cited. Add Organization or Person schema with sameAs entries pointing to Wikidata, Wikipedia, LinkedIn, Crunchbase. The model uses those edges to resolve who you are talking about.

4. Primary-source status

Was your page the original source of the claim, or are you echoing someone else? Google's grounding model heavily prefers what it can identify as primary documentation: a vendor's own product page over a review of that product, a study's own publication over a summary of that study, an original case study with proprietary data over a roundup post that quotes it.

If you publish original data, even a 200-row dataset from your own customers, you become a primary source for the claims that dataset supports. Nothing engineered competes with that.

5. Fresh signal

For queries with temporal volatility (anything regulatory, anything technological, anything tied to a date), recency matters disproportionately. The model looks at the publication date in your structured data, the lastmod in your sitemap, and the visible date on the page. All three should agree. When they disagree, the model often discounts the page entirely.

For evergreen queries (definitions, how-to, comparisons), recency matters far less than density and entity match. Stop refreshing pages that do not need it. Refresh the ones tied to dated systems.

What does not matter as much as you think

Two categories of signals get talked about constantly and do almost nothing for AI Overview citation.

Domain age (below a threshold)

Below roughly 6 months, very new domains do struggle to get cited, mostly because they have not been crawled enough to build the entity graph the model relies on. Above that threshold, age has no meaningful effect on citation rate in our data. A 2-year-old domain and a 12-year-old domain with comparable content quality get cited at the same rate. Domain age is a coarse proxy for trust that the AI system has better signals for.

Backlink count (above a threshold)

Domains need some baseline of inbound trust to be considered, but above maybe 30 to 50 distinct referring domains, more backlinks do not predict more citations. A page on a domain with 200 referring domains and a page on a domain with 20,000 referring domains, holding content quality constant, get cited at roughly the same rate.

Citation is content-level, not domain-level. The model is asking "is this passage the right answer" not "is this domain authoritative." Backlinks help you get crawled and indexed. They do not get you into the answer box. Our benchmark of what AI engines cite shows what does.

How to engineer for it

Knowing the signals is half the work. Restructuring pages around them is the other half. It is also why scaled AI content fails at getting cited: volume without passage-level quality earns nothing. Concretely:

Lead every H2 and H3 with a direct, self-contained answer in the first 2 sentences. The model loves pulling these.
Add at least one quantitative claim per major section. Numbers anchor passages. If you have data, use it. If you do not, find some.
Add Organization or Person schema with sameAs links to your Wikidata entity if one exists. If you do not have a Wikidata entry and you have any media coverage, create one. It is the cleanest entity disambiguation move there is.
Match publication date in Article schema, sitemap lastmod, and visible page date. Update all three together when you revise.
Publish at least one piece of original data per quarter. Customer survey, internal benchmark, anything you can claim as primary source. This compounds.

What a passage-optimized H2 looks like

<h2>How long do AI Overview citations last</h2>
<p>AI Overview citations rotate at roughly 3 to 8 week
intervals for competitive queries and remain stable for
6+ months on long-tail informational queries. Across 26
managed domains, the median citation half-life was 5.2
weeks. The volatility comes from Google re-grounding
answers, not from your page being penalized.</p>

That H2 will outperform a version that just teases the answer ("the data has some interesting things to say...") by a wide margin. The first one answers the question. The second one promises to.

Quotable

Across the domains we run today, the pages that get cited inside Google AI Overviews share five measurable traits: high information density (more nouns, entities, and numbers per sentence), tight semantic match to the query's intent vector, clear entity disambiguation through schema sameAs links, primary-source status for the underlying claim, and a fresh signal where the page date, sitemap lastmod, and Article schema datePublished all agree. Domain age above 6 months and backlink count above 30 to 50 referring domains have no meaningful effect on citation rate once those five traits are held constant. A page on a small domain that gets the five traits right gets cited at roughly the same rate as a page on a large domain that does not. Citation is decided at the passage level, not the domain level.

What to do this week

Pick three pages that rank in positions 4 to 15 for queries you care about and that have no AI Overview citation today. For each one:

Rewrite the first 2 sentences after every H2 to be a direct, self-contained answer with at least one concrete noun or number.
Add Article schema with a matching publication date, and Organization schema with sameAs to your most authoritative external profiles.
If the page makes a claim that could be backed by original data you have, add the data inline. A table, a chart, a single number with provenance.

Check back in 10 to 14 days. Citations move slowly, but the pattern is consistent enough that if all three pages still have no citation after a month, the problem is upstream: the query does not trigger AI Overviews, or your entity is not in the model's knowledge graph yet. Both are fixable, but they are different problems.

The shortest version of this whole article: density, semantic match, entities, primary source, freshness. Engineer for those five, ignore most of what people say about backlinks and domain age, and your citation rate will move. The only way to know it moved is to measure your AI citation rate week over week.