In this chapter, we uncover the mysteries of latent semantic indexing and how it may or may not be used by Google's algorithm.
In this section, we’ll explore the concept of latent semantic indexing (and keywords), compound keywords and how they may or may not improve your SEO efforts.
What Are LSI Keywords?
LSI keywords (within the SEO community) are words or phrases that search engines like Google see as being semantically (or conceptually) related.
It’s worth noting LSI keywords are not direct synonyms (more on synonyms later), but instead, are terms that are closely related to the keyword that you are trying to target.
For example, if you’re writing an article about football, then a synonym would be “soccer”, but LSI keywords would be ball, pitch, referee, striker, defender, midfield etc.
That being said, Google’s John Mueller says they don’t exist.
There’s no such thing as LSI keywords — anyone who’s telling you otherwise is mistaken, sorry.
— 🍌 John 🍌 (@JohnMu) July 30, 2019
But if Google is saying that LSI keywords aren’t important, why mention them at all?
In order to understand this, let’s take a closer look at LSI.
What Is Latent Semantic Indexing?
Latent Semantic Indexing is a concept that is talked about a lot in the SEO community and pretty much every article you’ll read boils down to these two things:
- Google uses LSI to index pages on the web.
- If you use LSI keywords in your content, you will have a higher chance of ranking on Google
Both of these statements aren’t exactly true.
So, let’s take a look at what LSI actually is.
Latent Semantic Indexing is a natural language processing technique that predates not only SEO, but the Web itself. It was invented by Microsoft’s Susan Dumais in the late 1980’s with the aim of indexing contents of document collections that remained static – i.e. they were unlikely to change. Dumais was granted a patent as the co-inventor of the Latent Semantic Indexing process in 1989 titled “Computer information retrieval using latent semantic structure”.
Here’s how the creators of the patent define the problem that they were trying to solve at the time:
“Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings–relevant materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, the irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately express intended meaning.”
In simpler terms, “The words a searcher uses are often not the same as those by which the information sought has been indexed.”
A quote from the great J. R. R. Tolkien comes to mind: “Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on?”.
In the above example, the word “good” has multiple meanings – it’s a synonym and polysemic word.
What are synonyms?
A synonym (as defined by the instant answer from a Google search), is “a word or phrase that means exactly or nearly the same as another word or phrase in the same language, for example shut is a synonym of close”.
Synonyms of the word “lucky” would be: auspicious and fortunate.
Synonyms are troublesome because “Users in different contexts, or with different needs, knowledge or linguistic habits will describe the same information using different terms.”
What are polysemic words?
Polysemic words and phrases are those that have many different meanings, for example, the word “bass” could mean:
- The fish
- A deep, low voice
The patent explains why polysemic words also pose issues when it comes to LSI.
“In different contexts or when used by different people the same word takes on varying referential significance (e.g., “bank” in river bank versus “bank” in a savings bank). Thus the use of a term in a search query does not necessarily mean that a text object containing or labeled by the same term is of interest.”
How Does LSI Relate to SEO?
Let’s take a look at how search engines could use LSI to solve the problems caused by synonyms and polysemic words.
If we have two identical web pages that are about chairs, but one substitutes the word “chair” for “seat”.
A search engine that was unable to identify that “chair” and “seat” are synonymous, would only return one of these pages for the query “chairs”.
Similarly, polysemic words like “bank” would return results that perhaps aren’t what the searcher is looking for – the search engine is not able to determine whether you are looking for “riverbank” or “institutional bank”.
The bottom line is that computers simply do not have the inherent understanding of semantic relationships between words that we humans do. A simple solution would be to tell the computer everything, but this would take a very, very long time to do and is unfeasible.
LSI helps solve this problem by using sophisticated mathematical formulae which derive the semantic relationships between words and phrases within a piece of text. This means that search engines are able to distinguish polysemic words and match synonyms to present more relevant search results to the user.
Does Google Use LSI?
Many SEOs claim that Google is using Latent Semantic Indexing, but what they really mean when they say this, is that Google is using synonyms and polysemic words in its algorithm.
On the surface, this appears to be correct – after all, if we search for the keyword “automobile”, we can see that Google returns the Wikipedia entry for “Car” as the first result.
Likewise, searching the polysemic term “mouse”, shows that Google is able to understand the difference between a computer mouse and the rodent.
Whilst Google is definitely looking at synonyms and polysemic words to derive semantic relationships, it does not mean that LSI is being used.
In fact, we’ve already seen that Google use Word Vectors to index the content on a web page.
On top of this, Google representatives themselves have quashed any insinuations that LSI is part of the algorithm – we saw John Mueller debunk LSI keywords altogether earlier.
Semantic Topic Clustering and Phrase Based Indexing
This section is for those of you who want to know more about how the latent semantic analysis in natural language processing works, and how Google uses techniques called semantic topic modelling and phrase based indexing to achieve this.
Before we dive into the nitty gritty details, let’s take a brief step back and look at what the Latent Semantic Analysis model is. The LSA model theorises how sentiment behind natural language may be learned by a machine without any direction or help from a human as to its structure.
In order to understand patterns from text, LSA follows the following assumptions:
- The meaning of sentences is defined as the sum of the meaning of all of the words that appear within it.
- The model assumes that the semantic relationships between different words within a sentence are not explicit, but are latent in the sample of language.
Semantic Topic Modelling
The LSA model takes into account several mathematical formulae to get an idea of the sentiment behind a document of text and is based on the concept of semantic topic modelling.
Semantic topic modelling (or semantic topic clustering) is a machine learning technique used to scan a set of documents to detect patterns between the words and phrases, and automatically form word groups and similar expressions that best summarise the set of documents.
A scientific paper from Google titled “Improving semantic topic clustering for search queries with word co-occurrence and bipartite graph co-clustering” provides insights into how Google is able to categorise search queries into topic clusters through two techniques: Word Co-Occurrence Clustering and Weighted Bigraph Clustering.
The paper explains how information from search queries can help provide interesting and helpful insights for businesses.
For example, knowing the search volume for particular products or brands tells you the number of people that are interested in those products. Likewise, this also tells you what people may associate with those brands – i.e. topics which in turn informs the creation of categories i.e. clustering beauty products into a single category.
However, one of the drawbacks of trying to learn from topics using this method, is that most search queries tend to be very short, which means that the number of other terms that appear “near” them (i.e. related terms) becomes very restricted.
For instance, the word “Mars” might show up frequently near words that are related to “planets” and to “chocolate”.
Word Co-Occurrence Clustering
Word Co-Occurrence looks at when the same terms or phrases appear frequently in documents that rank highly for a search query. For example, when the term “strong coffee” appears in a web page, the term “espresso bean” probably also tends to occur. This means that these words or phrases are likely to be semantically related to the terms that they rank highly for.
Word co-occurrence clustering is defined by the following formula which assigns each cluster a “lift score” or weight.
If for example the lift score is 5, then the probability of the query wi being searched for, given the action a, is five times higher than the general likelihood of wi being searched for.
The paper states that “A large lift score helps us to construct topics around meaningful rather than uninteresting words. In practice the probabilities can be estimated using word frequency in Google search history within a recent time window.”
This is represented by a co-occurrence matrix, which if you haven’t guessed already, is another patented technology granted to Google.
To generate the co-occurrence matrix, Google uses a similar method to Tf-IDF Vectorisation.
Given m documents, and n-words in the vocabulary, an m x n matrix can be constructed where each row represents a document, and each column represents a word.
Tf-IDF (which stands for Term Frequency- Inverse Document Frequency) is a technique used by search engines to retrieve information about a piece of text and weighs a term’s (or word’s) frequency (Tf) and its inverse document frequency (IDF). In other words, each word or term is assigned its own TF and IDF score, and the product of these two values returns its TF*IDF weight.
The higher the TF*IDF score (weight), the less often the term appears in the content (and vice versa)..
Here’s the formula for TF*IDF:
For a term t in document d, the weight Wt,d of term t in document d is:
Let’s go through an example.
If our web page has 200 words, and it mentioned the word “dog” 20 times, the TF for the word “dog” is:
TFdog = 20/200 i.e. 0.10.
Now, let’s say the term “dog” appears x amount of times in a 1,000,000 million document-sized corpus (i.e. the big wide web) and let’s assume that there are 250,000 documents that contain the term “dog”.
Then, the IDF (which measures how significant that term is in the whole corpus) is given by the total number of documents (1,000,000) divided by the number of documents containing the term “dog” (250,000).
IDF (dog) = log (1,000,000/250,000) = 0.60
Therefore, Wdog = 0.10 (TF) * 0.60 (IDF) = 0.06
If we did the same for the word “puppy” and found that the TF*IDF is 0.10, then we can infer that the term “dog” appears more frequently than “puppy” because it has a lower TF*IDF weight.
By applying this formula to each word within the document (or query), we form our co-occurrence matrix.
With these vectors (matrices), the model is able to apply measures like cosine similarity to essentially calculate the “distance” between them.
So for example, if two vectors have a smaller cosine similarity, then the algorithm infers that they are topically related.
Google uses the lift score to “rank the words by importance and then threshold it to obtain a set of words highly associated with the context.”
According to Google’s tests, this method works well when “the queries are closely related, e.g. brand queries, so that the keywords expansion step can effectively extrapolate the scope of words to reach broader topics”.
However, if wi is not defined clearly, then the second method is applied: weighted bigraph clustering.
Weighted Bigraph Clustering
The bigraph clustering method involves taking words and phrases from many web pages that already rank for a particular search term, and combining them to pull out phrases that appear within these documents. These words/phrases are then clustered into groups based on how frequently they appear within the documents.
Weighted bigraph clustering is based on the following assumptions:
- “Users may phrase their query differently, including variations and misspellings, but a search engine understands they are close and present the same URL to the users. Hence, URLs can identify queries of similar meaning”.
- “URLs that are shown as top search results for a single query are somewhat similar. Hence, queries naturally group similar URLs to together”.
With this method, the search queries are compared against the top ranking web pages where query-URL pairs are created. These pairs are weighted according to users’ CTR (click through rates) and page impressions which allows the algorithm to identify similarities between the core keyword and related terms and therefore, create semantically related clusters.
Above, is an example from the paper of semantic clusters for “Lipton” brand related queries using bigraph clustering.
This method is great for grouping queries that are semantically close together because it utilises the information that is embedded into Google’s search results. As a result of this, weighted bigraph clustering is able to perform much better than word co-occurrence clustering even if the queries do not share any common words.
In summary, the two methods outlined above describe how Google can better understand topics that may be related.
The term “LSI keywords”, as we’ve seen in the introduction to this section, is a little misleading as it doesn’t exactly exist. When SEOs talk about “LSI keywords”, what they are essentially referring to are the related words, phrases, and entities to the keyword you want to target,
Therefore at SUSO, we like to call these “compound keywords” instead.
We like to define compound keywords as keywords that are related, but do not necessarily include direct terms.
Examples of compound keywords would be questions, and topical phrases that relate to the core keyword that you are trying to target.
Including compound keywords within your text, will almost definitely improve your site’s search presence.
One of the reasons for this is the fact that we know from previous modules that RankBrain and Neural Matching, along with Natural Language Processing techniques, help Google gain a better understanding of concepts and topics within a piece of content.
In fact, this is also evident from this article published by Google:
“Just think: when you search for ‘dogs’, you probably don’t want a page with the word ‘dogs’ on it hundreds of times. With that in mind, algorithms assess if a page contains other relevant content beyond the keyword ‘dogs’ – such as pictures of dogs, videos or even a list of breeds.”Google
Google is able to see that a list of dog breeds is semantically related to the core keyword “dogs”.
Say we had two pages, both of which contain the same number of mentions of “football”, but, one is about sports (Page A) in general, and the other is about football (Page B).
🤔 Which one should rank higher?
Looking at the surrounding text helps determine which page should rank higher than the other.
Therefore, Page B would be ranked higher by Google because it is able to use the compound keywords (related keywords) to determine the article’s topical relevance to the core search term “football”.
The SUSO Method: Finding Compound Keywords
If you’re writing about a topic that you’re already familiar and knowledgeable about, you’ll naturally include compound keywords within your copy. After all, that’s how it should be!
For example, when writing an article about the best Italian restaurants in London you would likely mention words and phrases like “pasta”, “pizza” and “spaghetti”.
That being said, for topics that may be more complex or unfamiliar, important compound keywords may be missed out.
So, let’s take a look at how you can (quickly and easily) find additional terms to include within your content.
Look At Google Autocomplete
One of the quickest ways to find compound keywords is using a simple Google search. Google’s autocomplete feature (which we’ve already covered previously), can give you an insight into what might be worth mentioning in your content.
For example, if we search “how to use apple pay”, we can see that Google presents several suggestions on similar related keywords. Importantly, this also shows you what user’s are likely looking for too.
The above recommendations imply that you should write about how to use Apple Pay on specific Apple devices.
Look At Google Related Searches
Apart from autocomplete, Google also suggests related queries at the bottom of the search results page.
If we were to write an article about How To Make Slime, we can see that Google displays several suggestions that it believes are already related to this original query.
So if you’re going to write about slime, make sure to include terms like “toothpaste”, “borax”, “glue” etc.
Reverse-Engineer the Knowledge Graph
Google’s Knowledge Graph is a treasure trove of finding related information for entities – this includes data about people, places, things or even concepts. Importantly, on top of the core pieces of information about these entities, Google also stores the relationships between them.
Let’s take a look at an example.
In the Knowledge Graph for Donald Glover, we can see that Google provides a list of entities that people also search for. However, you can also use the list of Movies and Tv shows as pointers as to what terms/topics should appear too.
Use An “LSI Keyword” Generator
There are lots of tools out there that generate “LSI Keywords”, but of course, we now know that LSI keywords don’t actually exist and that these generators probably don’t have anything to do with Latent Semantic Indexing. However, regardless of what label we give them, they do offer quick and easy insights into what kinds of terms you should be using within your content.
A popular tool is LSIGraph, which generates “the most profitable semantically related keywords”. All you have to do is type in the keyword you want to find compound terms for and the tool does the rest.
For example, if we plug in the keyword “donald glover”, we can see that it provides a long list of possible related terms and phrases that could be used.
Reverse-Engineer Your Competitors
Looking at the keywords that the top competing pages for the core term you want to target are ranking for is a great way of finding hidden gems that you may otherwise have overlooked. For this, we would recommend using Ahrefs’ Keyword Explorer tool to find compound terms.
The “Also rank for” feature is perfect for this; see the example above for “how to make slime”.
If you don’t have Ahrefs, you can still reverse-engineer the top ranking pages for the keywords you want to rank for by manually looking at the pages themselves for topics or terms that you perhaps have overlooked.
The SUSO Method: Using Compound Keywords
Once you’ve got a long list of compound keywords and related terms, you need to use them effectively in your content.
It’s important to ensure that the keywords that you have collected satisfy the user intent and answer the core questions that the user might have about that particular topic or search query.
Remember, there are three main types of intent based keywords:
- Informational – keywords that aim to inform the user on a broader scale i.e. “what is the movie Parasite about”
- Navigational – keywords that are more specific i.e. “who directed Parasite”
- Transactional – keywords that relate to making a purchase i.e. “Parasite dvd”
Once you’ve categorised the keywords based on their intent, you can then align these with your existing content and see where they fit in contextually. In some cases, you may need to add an entirely new section to help target a specific subset of compound keywords.
For instance, if you want to add some content about topics like “how to make slime” you may have different subheadings for different ways to make slime i.e. “without borax”, or “with glue” etc.
Latent Semantic Indexing and LSI Keywords aren’t exactly what they appear to be on the surface. Google almost definitely does not use Latent Semantic Indexing and have written off LSI keywords, but ultimately, what we can confirm, is that semantically related words (compound keywords), are important, and Google is definitely looking at these to improve their understanding of language and content.
For this reason alone, including compound keywords within your content is beneficial for your SEO.
You just have to ensure that these related terms are used sparingly and within the right context to avoid keyword stuffing.