The Content Relevance Problem and How Google Actually Decides What Your Page Is About

Every SEO professional talks about content relevance. Make your content relevant. Write relevant content. Optimize for relevance. It's the kind of advice that sounds helpful until you realize nobody is explaining what relevance actually means inside the system that determines your rankings.

Here's the problem: most advice about content relevance treats it like a writing quality. If you write well, target the right keywords, and cover your topic thoroughly, your content is "relevant." That framing is about fifteen years out of date. Google doesn't evaluate content relevance the way a human editor would. It runs your page through mathematical models that determine, with numerical precision, what your page is about and whether that aboutness matches a given search query.

I call this the aboutness problem, and it's the gap between what SEO practitioners think relevance means and what Google's systems are actually measuring.

What content relevance means for SEO inside Google's ranking systems

Content relevance is a mathematical score that measures how closely your content matches what a searcher is looking for. Google determines content relevance by converting both the search query and your page into vector embeddings, then measuring the distance between them.

Google doesn't read your page the way a human does. It converts your content into numerical representations, compares those representations against the query, and assigns a relevance score based on mathematical similarity.

The old version of this was keyword matching. If a user searched "content relevance" and your page used that phrase twelve times, you were relevant. That approach died when Google rolled out BERT in 2019 and then MUM in 2021. Both models use contextual embeddings, which means they understand words based on every surrounding word in the sentence. The word "apple" on a page about farming means something completely different than "apple" on a page about technology, and Google's embedding models capture that distinction without needing you to spell it out.

This matters because content relevance is no longer about whether you used the right words. It's about whether the entire semantic meaning of your page, converted into a high-dimensional vector, sits close to the query vector in that same space. As iPullRank's relevance engineering framework puts it, relevance is not a qualitative measure for search engines. It's a mathematical one.

The API leak confirmed Google scores your topic focus

In 2024, a massive leak of Google's internal API documentation revealed more than 14,000 ranking attributes. I wrote about how the API leak changed everything when it first happened, and six months later the implications are still unfolding.

Two attributes from the leak matter enormously for content relevance: siteFocusScore and siteRadius.

SiteFocusScore quantifies how concentrated a site is on specific topics. It's a 0-to-100 scale where higher means more topically focused. SiteRadius measures how far any individual page's topic drifts from the site's core theme. Together, they form a mathematical system for evaluating aboutness at both the page and site level.

What this means in practice: Google vectorizes your entire site and individual pages, then compares those embeddings. A page about "content relevance" on an SEO-focused site will score differently than the same content on a general marketing blog. The site's topical DNA matters, and pages that drift from it get penalized through a higher siteRadius score.

I've seen this play out in audits. Sites that publish scattered content across unrelated topics almost always underperform sites with tight topical focus, even when the drifting site has stronger backlinks. The API leak just confirmed what the data was already showing. When I examined the leak's implications six months later, the pattern only got clearer.

How embeddings replaced search keywords for determining aboutness

Before BERT, Google determined aboutness primarily through lexical signals: keyword frequency, placement in titles and headings, proximity of related terms. That was the era of keyword density and TF-IDF optimization.

The shift to transformer-based embeddings changed the entire calculation. Now Google converts both the query and your page content into vectors in a shared semantic space. Relevance becomes a geometry problem: how close is your document vector to the query vector? Cosine similarity and dot-product calculations handle the math.

This is a two-stage process. First, Google's retrieval system finds candidate documents whose embeddings are geometrically close to the query embedding. Then a reranking system, which is more compute-intensive, evaluates whether those pages actually satisfy the information need. Lumar's analysis of semantic search models breaks this down well.

Entities changed the game further. Google's Knowledge Graph allows it to disambiguate concepts at the entity level. As I wrote about in how Google clusters entities and why your brand needs a knowledge panel, Google doesn't just understand what words are on your page. It understands which entities your page references and how those entities relate to each other.

A page that mentions "neural networks," "training data," and "transformer architecture" isn't just about those keywords. Google recognizes those entities as belonging to the machine learning domain, and your page's aboutness score reflects that topical cluster. This is why NLP models like BERT and MUM parse your content in ways that go far beyond what keyword research tools can measure.

Google's patents reveal they track when your topic drifts

Bill Slawski spent years analyzing Google's patent filings before his passing, and his work on aboutness remains some of the most valuable SEO analysis ever published. One patent he highlighted, US9031929B1, describes how Google calculates site quality scores based partly on topic consistency.

The patent describes a system that tracks what content a page is "about" using multiple representations and monitors whether that aboutness changes over time. Pages that maintain consistent topical focus score better than pages that shift subjects. This isn't speculation. It's in the patent language.

Slawski also identified how Google indexes phrase patterns on pages to understand aboutness. A page about baseball stadiums naturally contains phrases like "pitcher's mound," "outfield fence," and "home plate." Google's systems recognize these co-occurring phrases as signals of topical aboutness, independent of whether you explicitly targeted any of them as keywords.

Google's information gain patent adds another layer. It describes how Google measures the uniqueness of a page's content compared to everything else in its index. Content that says the same thing as ten other pages on the same topic contributes zero information gain and gets scored accordingly. Your page's aboutness isn't just about what topic you cover. It's about whether you're adding something new to the conversation about that topic.

The data shows topic focus beats broad coverage

The theoretical framework is one thing. The performance data is another.

A 2023 study by Graphite analyzed 332 URLs across 12 domains and found that content with high topical authority gains search visibility 57% faster than content with low topical authority. Those pages were also 62% more likely to generate traffic within their first week. That's not a marginal advantage. It's a structural one.

Separate research examining 253,800 search results found that page-level topical authority is the single largest on-page ranking factor, outperforming even domain-level traffic metrics. Sites that restructured around topical focus saw organic traffic increases of 60% within six months.

iPullRank's relevance engineering methodology takes this further by using vector embeddings and cosine similarity to calculate semantic closeness between articles and topic centroids. They assign objective relevance scores on a -1 to 1 scale, then prune or restructure content that falls below threshold. It's the same math Google uses, applied proactively.

I've applied similar thinking in audits. One manufacturing client had a blog section that covered everything from hiring tips to industry awards. After we pruned the off-topic content and restructured around the core topics that supported their service pages, organic traffic to their key service pages increased 40% in four months. The content we removed wasn't bad. It was just about the wrong things.

If you've watched what happens when rankings drop after a site launch, you know that topic dilution is one of the most common culprits. New sites often launch with content that's scattered across too many subjects, and Google's aboutness scoring punishes them for it.

What this means for your on-page optimization

The aboutness problem has practical implications for how you build and optimize pages.

First, stop thinking about content relevance as a writing quality and start thinking about it as a mathematical property. Your page has an embedding. That embedding sits somewhere in vector space. The question is whether it sits close to the queries you want to rank for.

Second, topic focus matters more than topic breadth. Google's siteFocusScore rewards concentration. Build your content strategy around tight topical clusters rather than trying to cover everything.

Third, entity density beats keyword density. Use specific, recognized entities rather than generic phrases. Name the patents, the researchers, the systems. Google's entity recognition is sophisticated enough to identify what your page is "about" based on the entities you reference.

Fourth, check your site's topic drift. If you're publishing content that doesn't connect to your core offerings, you're raising your siteRadius and diluting your relevance scores across every page. An honest on-page optimization audit should evaluate not just individual pages but the topical coherence of your entire content inventory.

The aboutness problem isn't going away. As Google's language models get more sophisticated, the distance between keyword optimization and true content relevance will only grow. The sites that understand what relevance actually means, mathematically, will keep ranking. The ones still stuffing keywords into headings will keep wondering why their "relevant" content doesn't perform.