This comprehensive blog post explores the SEO-GEO ...
This blog reveals why traditional marketing attrib...
This blog explains how the old growth playbook is ...
The introduction section identifies the challenges marketers face in determining which AI prompts to monitor for visibility purposes. Marketers often panic about the multitude of ways users may submit their inquiries to an AI engine and wonder whether they need to monitor every variation of each keyword.
However, based on a recent large-scale study, it appears that there is a predictable component to the wording. The phrasing space isn't chaotic; it has structure. By understanding this structure, your approach to tracking visibility for AI searches can change dramatically.
Over 90% of the various phrased prompts had a very similar meaning. Intent will determine whether or not brand names were carried through to any rewrite of that prompt. Therefore, you do not need to be concerned with using the exact word(s) stated in any prompt. As long as the underlying intent remains the same, brand names will remain consistent.
The surprising addition to this information is that style is equally as important as meaning. For example, using short keyword prompts or list-style prompts yields 20% more brands being pulled than using open-style prompts. The format used for providing a prompt to the AI will also affect what the AI provides back, regardless of intent.
The mathematics of how people create prompt similarities is far from simple.
Humans are saying basically the same thing using different prompts. In fact, less than 10% of prompts are semantic drift compared to the difference between prompt messages in terms of meaning.
How did we demonstrate this? We compared each of the prompts' semantics using embeddings and quantified the amount of semantic distance using cosine similarity. Cosine similarity is a measure of how similar two words are based on their meanings, rather than the length of the individual text alone.
About 88%-92% of pairs of human-generated prompt pairs are above 0.50 for cosine similarity, approximately 95% of the generated prompt pairs will be above 0.40 for cosine similarity, and it is important to note that, while users generate many different types of phrases to communicate their commercial need, most of those phrases will end up with similar meanings when mathematically processed.
Understanding this is crucial for AI companies because, at first glance, it appears that users create randomly derived related phrases. However, beneath the surface, when comparing the meanings of prompt phrases, those phrases are often similar enough in meaning to cause adjacent brands to be triggered as having provided that same service to satisfy the user's need.
One user may use "best noise cancelling headphones under $200," while another user will use "which budget over-the-ear headphones have good noise cancelling." Although there is clearly a significant lexical difference between the prompts generated, the underlying commercial need is nearly identical.
Brand mentions can be affected by wording changes beyond some point in the dataset. Within a reference group of approximately equal size, the average probability for that group of mentioning a brand was 4.9%. When prompted for variation (i.e., similarity) between the prompt and the core in their meaning, varying from 0.35 to 0.39 (the lowest similarity bin), brand visibility diminished by 2.40 percentage points or about 50% relative to its initial probability.
The sharp decline occurs almost exclusively within the left tail. As long as a prompt has at least 0.50 to 0.60 cosine similarity with a core prompt (espresso vs. latte), the proportion of expected brand visibility has not significantly changed. Changes in AI-generated outputs, however, are only going to lead to significant reductions in visibility caused by wording differences when the underlying meaning of the prompt drifts considerably from the origin.
Since the vast majority of individuals will write above the 0.50 to 0.60 threshold, the risk associated with writing below this threshold is much narrower than it may initially appear. A similar level of qualifying intent and semantics from prompts almost always leads to a different level of brand recognition with the same frequency.
Do not confuse a high degree of similarity with having the same purpose. There are 95% similarities between the two phrases "car rental Charleston" and "car rental Charlestown"; the two are typically for completely different sets of sales. Changing a core qualifier, such as time, entity, location, demographic group, and/or brand type, will almost invariably change the nature of your search intent.
The way you create a prompt will create two out of three total components. The types of formatting you use in a prompt will result in variability of the number of brands that AI will return when answering that prompt. For example, by asking for an example of a comparison, chart, list or ranking will yield many more brands, as compared to an open-ended prompt.
Ranking prompts will generate significantly more brand mentions in the answer than open-ended prompts. The average difference in brand visibility is approximately 20% with respect to answers to each rank compared to open-ended answers.
Keywords work better than conversations. Even though AI has a conversational interface, succinct and keyword-based prompts, such as "best small business" or "CRM 2026," will yield a greater number of brand mentions/average brand visibility (up to 25%). When using keyword prompts, the prompt has an extremely clear commercial retrieval anchor, whereas persona-based prompts (e.g., "as a consultant for IT") will often broaden the scope of the query towards educational pathways that contain less brand density.
This goes against what would be intuitively expected based on using the conversational style of the AI. However, in general, model output will generally yield a greater number of brand mentions to the sharp commercial anchor type of prompt than to the elaborate persona-based type.
Answer engines will provide different results based on how the prompt is structured. For example, when features or budget constraints are added to the prompt, there will be different outcomes based on the answer engine(s) being used. For example, when using ChatGPT or Perplexity, the engine will reduce the number of displayed brands to an answer/given budget or features. In contrast, Gemini and Google AI Overview engines will actually significantly increase the number of displays. Therefore, it is possible that additional fanout queries were created due to the constraints provided.
The amount of text you type does not matter; the amount of text written with filler or conversational words makes no difference to the brands shown in the answer. For example, even if you add 100 words of conversational-interest context that will be ignored by AI, the AI will only pay attention to the commercial intent of your answer.
The phrasing of prompts, or keywords used to submit customer queries, has a different level of importance based on where they are in the customer buying journey. For example, top-of-funnel (TOFU) customer queries tend to be less sensitive to small variations of wording than those further down the journey. When customers ask broad category questions, like “what is CRM?” it doesn’t matter if they use different words to describe a customer relationship management solution (CRM). Therefore, there is typically little variability in the brands listed in the search results.
Middle-of-funnel (MOFU) customer queries tend to have a higher level of sensitivity to wording than TOFU queries. If a customer searches for unbranded commercial search terms, such as “best CRM for small remote teams,” small variations in the way the keywords are presented can significantly affect the brands listed in the search results. In fact, you will see a difference in the brands you see in the search results between the 0.60 and 0.65 buckets. Thus, the wording may impact which brands end up appearing on the search results page.
Bottom-of-funnel (BOFU) customer queries typically show false stability because BOFU searches are often made using a given brand name (unbranded queries). As such, BOFU queries tend to show little difference between different brands, as the brand is the anchor for each query and locks in the results regardless of how the query is worded.
In order to gain a complete understanding of MOFU prompts, you need to follow more variations of your MOFU prompts. Conversely, fewer variations of your prompts from TOFU and BOFU will yield enough variations. Therefore, you may have TOFU 25%, MOFU 50%, and BOFU 25% in total.
The distribution makes sense once you start to think about it. The top part of the funnel has to do with being aware of a category. The bottom part is all about retrieving a brand. Lastly, we have the middle part of the funnel, where companies will learn that the new category you created, but using different words, can dramatically alter which companies will show up when someone searches for it online.
The effects of wording in AI engines have a common direction, but severity varies. For example, Gemini reduces its effect in a very short time and shows little impact on the lower similarity buckets. Google AI Overviews shows a more consistent effect on the middle of the funnel, with small changes in wording affecting the amount of visibility in that engine considerably more than in others.
There is also a wide variation between the visibility penalties across ChatGPT, Perplexity, and AI Mode for Google. In ChatGPT, the brand visibility will drop as soon as the phrasing drops below the 0.60 - 0.64 bucket.
When aggregating the visibility data across models, be careful. You cannot simply aggregate all of the AI engines together and express their cumulative visibility as one value. Each AI engine has an individual level of sensitivity. Google AI Overview has the most sensitivity to changes in wording. Gemini has the least. Both ChatGPT and Perplexity have moderate levels of sensitivity.
Segment your buyer's journey according to how early you are in their purchase process. Top-of-funnel searches give you a baseline of your category's visibility, and bottom-of-funnel searches tell you how customers retrieve your brand in a search engine results page (SERP), but because there are many ways to phrase a query that produce different results, you'll need to have greater accuracy in phrasing at this level of the funnel (compared to top-of-funnel and bottom-of-funnel) and track a larger number of queries as a result.
Use your buyer's language as your anchor. There is no "perfect" base prompt for every buyer's journey stage; an effective anchor will align with your buyer's intent and persona. Have a few co-workers quickly examine how they would type in that specific prompt - if their responses do not have an average similarity score of at least 0.50 with yours, it means that your prompts show insufficient variation, and you may want to add another anchor phrase to track.
Avoid mixing prompt types when tracking. The differences in particular styles of prompts are the result of formatting, archetypes, and/or constraint levels, and each of these creates a different starting point for a particular search. For example, an open-ended prompt is not the same as a list prompt; however, if both kinds of prompts are tagged as being based on different formats, then you may incorrectly interpret your changes in visibility as actually reflecting variation in style.
Pay close attention to constraint attributes located in the middle of your funnel. When there is no brand anchor, changing constraints by a small percentage (e.g., adding or removing integration, team size, and budget constraint) can influence which brands are reported as appearing based on the constraint criteria. It is important to monitor several constraints that capture these subtle differences within the same persona group (e.g., what is the best CRM system for a small-sized company team, versus what is the best CRM system for a large enterprise).
Do not track the long tail. Human variation naturally clusters, and visibility decreases drastically when the prompts drift to between a 0.40 and 0.50 similarity threshold. Focus your tracking budget on the basic semantic middle of where most prospective buyers submit type-to-search queries; this will reduce your need to track variations that are rarely utilised.
Track each AI instance independently. Establish the tracking data on a per-instance basis prior to developing any blended perspectives. This will identify whether changes in visibility have occurred from an overall marketplace trend or a specific algorithmic anomaly within one instance. Aggregated data will sometimes conceal the actual result.
You could see that these commonalities existed among 37,804 AI responses. But it’s important to note the following caveats with respect to trends. Trends are not guaranteed. The percentages shown are indicative of patterns you can reasonably expect to observe. These percentages do not have to be viewed as static rules for every query.
Regulated industries may not behave similarly to unregulated ones. For instance, 18 sub-verticals were tested. Regulated categories, such as healthcare, typically have much stricter safety guardrails than unregulated categories; therefore, you may want to test if these patterns hold true for industries that have compliance requirements.
Engines change frequently. The actual percentages you see for any particular pattern may differ from the 18 published models over time due to model advancements or changes made to grounding systems. Only the underlying mechanics, such as wording thresholds, mid-funnel sensitivity, or baseline styles, will likely remain constant. Your tracking strategy will need to adapt as engines change so that you’re able to reach the most accurate conclusions when evaluating your performance on an ongoing basis.
A prompt-tracking system developed by IcyPluto is based on a 6-step methodology, as discussed in the paper from our conference. The goal is to track the number of search inquiries made by potential buyers using semantic middle terms, rather than searching through every potential variation of the same words.
The tracking process begins with categorising the prompt stages in the funnel, resulting in our clients receiving 25% top-of-funnel (awareness), 50% mid-funnel (competition & brands), 25% bottom-of-funnel (brand retrieval) prompts. This methodology represents how frequently a person would see the keywords for search across the funnel.
When setting up tracking, we work with our clients to convert their team's natural conversations to a prompt and use that new wording as the basis for our tracking process. Our use of the buyer's phrasing allows us to match the buyer's behaviour to the search engine and reduces the likelihood of making poor decisions based on an engineer's posturing about the best search phrases to use.
We tag each prompt based on style so that we can avoid combining signal styles when tracking visibility changes due to phrasing. When reporting results, we report on format-specific visibility to assist clients in determining which prompt type performs best against the other types.
We monitor variations in constraints during the middle stage of the sales funnel. For commercial inquiries, we generate several prompts that reflect diverse budget levels, team sizes, and required features. The optimal CRM for a startup may differ from the best choice for a company with 100 employees. We analyse both scenarios to gain a comprehensive view of the commercial discovery process.
Each engine is reported individually. Our dashboards display distinct metrics for ChatGPT visibility, Perplexity visibility, Google AI Overviews visibility, and Gemini visibility. We only combine views after fully understanding how each engine performs on its own. This approach helps us avoid overlooking specific changes related to individual engines that could be obscured by aggregated data.
The GEO Dashboard assists us in evaluating prompts across various LLM models. It provides clarity on how brands are represented within specific topic clusters and identifies which AI models are most effective. Additionally, we assess Share of Voice in AI search; this metric indicates how frequently your brand appears in AI-generated responses compared to its competitors.
The high-visibility prompt style has been used for creating content that is optimised to be shown in AI results. The reason for using list and ranking prompt styles is that they will produce 20-25% more brands. In order to present this content, we use comparison tables, ranked lists, and bullet-point comparisons. The answer to the question will almost always be answered within the first 1-2 sentences, which will then be followed by additional supporting details. This architecture of answer first produces the same results that AI models will use for citation purposes.
We develop precise entity pages that allow AI to easily identify brands as distinct entities. To achieve this, we standardise the names of all organisations/products/people and implement the Organisation/Product/Person schema. We also ensure that each profile is identical in all directories and knowledge graphs. This increases the probability of the AI model identifying your brand as a unique entity within your category.
The results speak for themselves. Our clients have reported a 40-60% increase in external mentions, an 80% or greater growth in LLM referral traffic, and an increase in Share of Voice in AI search results. Most importantly, your brand is visible when decision makers are performing research using AI assistants.
You may feel you can't track prompts because of the fact that everyone is different, and you're unsure about how your audience is typing. But don't worry! The wording space is not a flat, disorganised field of random differences; there is shape and structure.
You do not need to track every single word or phrase, nor do you need to chase after an infinite number of variations on individual words or phrases. Instead, you can simply understand the overall intent you want to track and the various contexts in which that intent is being expressed. Look for the meaning of a word or phrase as opposed to its wording, separate style from meaning, separate style from funnel stage, and look for each of the AI engines.
This is how you can track AI prompts without chasing every single variation. Focus on intent. Categorise by format. Track the majority of variations in the middle of the funnel. Provide reporting by the AI engine. Don't squander your tracking budget on the left tail of the distribution, where the vast majority of consumers do not type any wording.
The brands that will be successful in search using AI will be those that understand the structure of prompts as opposed to brands that try to track every prospective variation. Intent is more important than keywords. Style is equally as important as intent, and the way in which you write your prompt will help you win in the middle of the funnel.