Yandex Algorithm: Identifying the Russian 2016 Searches

Every December the Yandex research team publishes a list of events, people and phenomena that have defined the year based on surges of search queries throughout the year.   The Yandex search list isn’t created by editors but by an algorithm that looks at the way certain topics have historically caused Yandex users to have a sudden surge of interest.

It’s logical to assume that people’s interest in a subject correlates to the number of search queries on this topic. However, search query volume doesn’t help analysts identify these specifics themes or popular topics for the year. The list of the most popular topics has actually remained relatively steady for many years based on common searches, such as people asking the search engine about the weather or about traffic jams on the roads every day.  These searches drive a lot of traffic and yield high search volumes.

Therefore, the “search themes of the year” are not what people have asked the most but those topics which have had sudden dramatic increases.  This can be a new event like “Brexit” or when a phenomenon becomes relevant again.  One good example everyone is aware of from this past year is Pokemon;  the popularity of Pokemon reemerged after the release of Pokémon Go.  This caused a sharp increase in the number of queries on Pokemon. When this search pattern happens with a particular topic, we call it a surge and note it a trend for the year.

The graph below shows search topics throughout the year. The yellow line represents searches for Pokemon go, showing the clear surge during July and its gradual decline as the phenomena faded out.


To identify relevant topics and organize them by importance, the algorithm performs an analysis of the bursts. First, it identifies the queries, which at some point during the year showed a drastic spike, and then it groups the queries by topic.  The process seems simple but due to the fact queries on one topic can be asked in lots of different ways, it is quite complicated.

To start, queries that contain the same word or set of words are grouped together.  For example, [louboutins song], [louboutins shoes], [louboutins Leningrad], [louboutins buy]. The algorithm then looks for other words that appear in these queries, and identifies relationships between them. This can help to understand what the intention is for the query and how they relate. Combined with the word Louboutins, “clip” and “listen” refer to the users’ interest in the song by Leningrad, while “buy” and “price” is clearly relating to the actual shoes shoes. These are two different topics.

To make the list of topics more precise, the algorithm compares the search results according to different queries in these groups.  Queries matching with certain results are categorized to fit into the same topic.

After this list is prepared, they are power ranked for the surge of interest. This is done by looking at the number of queries on the topic and then analyzing the difference between the bursts of interests per topic.  Power is estimated in points from one to one hundred. Comparing the scores, you can figure out which of those caused a greater interest. To illustrate, let’s take a few examples from this year, the next season of “Game of thrones” (100 points) interested users more than the elections in the USA (75 points) and a new iPhone (78 points).