How SEO Embeddings Predict Which Content Will Rank

Last year I audited a manufacturer who could not figure out why their best product page would not move. The content was thorough. The page had links. On paper it should have ranked. So I stopped reading the page and vectorized it instead. I ran the page and the target query through an embedding model, scored the distance between the two vectors, and the number was ugly. The page sat nowhere near the query in vector space, and the rankings agreed with the math. Before Google told me anything, the embeddings already had.

Last year I audited a manufacturer who could not figure out why their best product page would not move.

Michael McDougald

That is the part most SEOs have not internalized yet. The relationship between your content and a query is now a measurable distance in vector space, and you can measure it yourself. You do not have to wait for the search results to tell you whether a page is relevant. The embedding tells you first, which is the whole reason SEO embeddings predict which content will rank.

What SEO embeddings and vector embeddings actually are

SEO embeddings are numerical representations of meaning. A search engine converts your page and the search query into vectors, sets of numbers that place text in a shared vector space, and then it measures how close those two vectors sit using cosine similarity. The closer they are, the more relevant your content looks, which is why SEO embeddings predict which content will rank before Google publishes a single position.

The idea is older than the current AI panic. Google patented it in 2013. The Word2Vec patent, word vector representations patent was filed by Tomas Mikolov and three colleagues at Google. The famous demonstration from that work still explains vector embeddings in one line. Take the vector for "king," subtract the vector for "man," add the vector for "woman," and the nearest vector is "queen." The math captured a relationship between words that no keyword match ever could. That is what an embedding does. It turns meaning into coordinates in vector space.

From keyword matching to vectors and semantic search

People ask what the difference is between an embedding and a vector. The vector is the list of numbers. The embedding is the process of assigning those numbers so that similar meanings land near each other. Word embeddings, the first practical breakthrough in natural language processing, did it for single words. Modern vector embeddings do it for whole passages, whole pages, and whole sites, and search engines now lean on them for nearly every query.

This is the shift from keyword matching to semantic search. Old search engines matched the literal string. A page about "affordable laptops" and a query for "cheap notebooks" looked unrelated to keyword matching, because none of the words matched. In vector space those two phrases sit almost on top of each other, because the embedding learned that cheap and affordable, laptops and notebooks, carry nearly identical meaning. Semantic search is just search engines using vector embeddings to read meaning and user intent instead of matching strings, and it is why stuffing the exact keyword stopped working years ago.

How cosine similarity turns semantic relevance into a ranking number

Here is the mechanism, because it drives everything that follows. When search engines crawl your page, they split the text into tokens, convert those tokens into vectors, and combine them into vector embeddings for your paragraphs and your page. They do the same to the query. Then they calculate cosine similarity, which is the angle between two vectors. A score near 1 means the two vectors are saying the same thing. A score near 0 means they are unrelated. A score near -1 means they are opposites. That cosine similarity score is your semantic relevance, the semantic similarity between your page and the query, expressed as a number between two points in vector space.

This is where passage-level retrieval changes the work. Search engines and large language models do not score your page as one blob. They score it in chunks. A focused paragraph about one idea sits tighter to a query than a sprawling paragraph about four ideas, because the four ideas pull the vector in four directions and dilute its meaning. iPullRank's testing on chunking found that splitting a multi-topic passage into single-topic chunks can lift cosine similarity by 15 to 20 percent, and a query-aligned heading above the passage lifts the semantic relevance further. The structure is not decoration. It moves the number.

Cosine Similarity Lift

Minimum Lift15 percent

Maximum Lift20 percent

Source: iPullRank

The correlation holds up under study. Lumar ran an analysis of more than 2,000 search queries, scored the semantic relevance of the top ten results with vector models, and compared those scores to ranking position. Vector embeddings predicted position better than older term-frequency methods, and the tightest correlation showed up in short elements like titles and H1s. The pages that ranked were the pages whose vectors sat closest to the query, and the effect was strongest exactly where you would expect it, in the most concentrated text on the page. If you want the longer version of how retrieval engines decide what to surface, I wrote about how AI Overviews decide which sources to cite, and the same cosine similarity logic runs underneath it.

The Google API leak put page embeddings and site embeddings in the ranking system

For years this was theory you had to argue for. Then the Google API leak handed over the receipts. The leaked Content Warehouse documentation, unpacked by Search Engine Land and analyzed in depth by iPullRank, names the embeddings attributes directly. Google computes a pageEmbeddings vector for individual pages and a siteEmbeddings vector for the whole domain. There is a site2vecEmbedding, a sitewide vector built the same way Word2Vec builds a word vector. And there are two scores attached to those embeddings that should change how you think about your site.

The first is siteFocusScore, which measures how tightly your site sticks to one topic. The second is siteRadius, which measures how far an individual page's embedding drifts from your site embedding. Read that again. Google builds a topical identity for your whole site as a vector, then measures every page against that identity. A page that wanders from what your site is about does more than rank poorly on its own. It sits at a larger radius from your center in vector space, and that distance feeds the ranking system.

None of the embeddings explainers ranking for this topic connect those dots. They tell you what a vector is. They do not tell you that Google's own leaked scoring function holds pageEmbeddings, siteEmbeddings, siteRadius, and siteFocusScore as live inputs. The leak even spelled out the tactic in plain language. Because embeddings are scored page by page and site wide, you should write your headings around real queries and make the paragraph under each heading answer that query directly. That is not an SEO blogger's opinion. That is the advice that fell out of Google's own documentation, and it is the same chunking discipline cosine similarity rewards.

How to predict which content will rank with a vector embeddings analysis

This is the part the title promises, and it is the part almost nobody does. You can run the same calculation search engines run, on your own content, before you publish it. A vector embeddings analysis is a prediction, not a postmortem.

Vectorize the page and score it against the query in vector space

Pull your page and your target query through an embedding model. I use Screaming Frog wired to an embeddings endpoint for this, the same workflow Mike King documented for vectorizing the web, though any setup that returns vectors and a cosine similarity score will do. Score the whole page against the query first. If the page scores low in vector space, it is not relevant in the way the search engine measures relevance, and no amount of new backlinks will fix a semantic relevance problem.

Break the page into chunks and find the weak vectors

Then score it paragraph by paragraph. Find your single best chunk for the query, see how its cosine similarity scores, and compare it to the chunk a competitor is winning with. With my manufacturer, the money page averaged a weak vector similarity to its head term because half the page talked about company history. We rewrote the page so the strongest paragraph answered the actual query in the first two sentences, cut the off-topic filler that was dragging the embedding sideways, and the cosine similarity jumped before the rankings did. The number moved first, the position followed. That predictive loop, scoring the embedding distance and decomposing the query into its component questions, is most of what I now do inside an enterprise SEO audit.

What SEO embeddings reward and what search engines punish

Once you can see the number, the optimization stops being mysterious. Vector embeddings reward some things and quietly punish others.

Topical focus, content pruning, and internal linking

Embeddings reward staying in your lane. If your siteEmbeddings vector says you are about industrial cable and you publish ten posts about office snacks, those posts sit at a wide siteRadius and drag your topical focus down. Pruning the outliers is one of the fastest wins I know, and it works because you are tightening the site's vector, not chasing a keyword. Vector embeddings also rebuild your internal linking for you. Once every page is a vector stored in a vector database, finding the closest pages to any given page by cosine similarity is a single query, and that ranked list is the internal linking map you should have had all along. The same vector math handles redirect mapping and content clustering, because every one of those tasks is really a question about which embeddings sit near which in vector space.

Keyword matching, semantic search, and the hybrid system

I am not telling you keyword matching stopped mattering. Search engines still run lexical matching alongside the vector models in a hybrid system, so the exact words on the page still carry weight. The shift is that exact-match keyword density is no longer the thing that wins. Semantic relevance is, and so is matching the user intent behind the query. Gianluca Fiorelli made this case well in his work on entity-based semantic search, where the same entity and embeddings strategy that won classic search results also made his client a frequent citation in ChatGPT and AI Overviews with no LLM-specific work. The discipline I called relevance engineering is just this done deliberately, using vector embeddings to engineer semantic relevance instead of guessing at it.

SEO embeddings reward the content work you should already be doing

Here is what I tell clients who get nervous about the math. SEO embeddings are not a loophole, and they do not reward tricks. They reward content that covers one topic thoroughly and answers the query in clear language, published on a site that knows what it is about. That is the same advice good SEOs have given for fifteen years. The only thing that changed is that semantic relevance is now a measurable distance in vector space instead of a guess, which means you can predict which content will rank instead of hoping it does.

The embedding score is a flashlight, not a cheat code. It shows you where your content actually sits relative to the query so you can fix the gap before you ship, not after you lose the position. If you want the full picture of how vector embeddings fit into ranking across AI search, I laid it out in the AI search survival manual. Run the number on your most important page this week. The math will tell you the truth faster than the rankings will.

By Michael McDougald