Skip to main content
  1. Posts/

How is Spotify search so fast ?

·5 mins
Backend Architecture Search Algorithm
Clément Sauvage
Author
Clément Sauvage
I like to make softwares

I find search issues quite fascinating. We all know how fast computers can be, but we also know they have limits. So how can some service provide such precise and fast search features ?

One instance that comes to my mind would be Spotify. The other day I was searching for a song, and my mind wondered, thinking how Spotify can give me such fast result on a huge catalog of music, and especially such relevant results ?

Note that, despite the title, this article won’t be centered around the exact set of algorithms and technologies used at Spotify for their search, but we will discuss the building of a search engine in general.

The relevance problem #

If we were to create a search engine from scratch, with zero knowledge on information retrieval technologies, we would first ask ourselves what purpose we want our search engine to fulfill:

Given a search query sent from a user, we want to return them the most relevant results. Relevant is subject to discussion, because it’s going to be context-dependent. It’s going to depend on who you are, what music you listen to, etc… Truly whatever defines the user. But most important relevance will be defined by what we wants the system to show the user (think, for an e-commerce website, the most relevant results are the ones that will maximize the user’s total spending maybe)

In the case of Spotify, relevance would be the songs the user is most likely to seek, what they want to listen to, which can be:

  • Songs often searched by the user
  • Songs often searched by the relatives, friends, or similar people to the user
  • Songs that are currently seeing a surge of interest
  • etc…

The efficient search problem #

As of 2021, Spotify has a catalogue of 82 millions tracks.

A naive iteration over a list of all their catalog wouldn’t make it, we cannot just simply do that when searching in a real world application:

  • One single search would end up taking seconds
  • We would only be able to search information by filtering, and in the Spotify case, filtering over the only field that is available: the keywords
  • It would not scale at all

Instead, usually, when we need information retrieval to be fast, in storage systems, we use what are called indices. In a very simplified fashion, they are additional data that we build from our main data (the songs catalogue), and maintain on each addition. These additional data can then be scanned way faster that the main data to find the relevant documents (songs) according to our search criteria.

Making our search results more relevant #

Modern applications search engines cannot rely on plain indices for the search. Otherwise, their results would be easily trick-able, and many people would criticize their results relevance.

We lack multiple parts in order to serve better results for our users, some of them being:

  • Non exact text matching
  • User tailored search results
  • Popularity based search results

Non exact text matching #

When we search for a song, we don’t want to have to type the exact name of the song, or the exact name of the artist. We want to be able to type a part of the song, or a part of the artist name, and still get the song we are looking for.

This is where fuzzy search comes in. Fuzzy search is a technique that allows us to search for a string that is approximately equal to the string we are looking for. It can be used to search for a string that is similar to the string we are looking for, or to search for a string that is close to the string we are looking for.

User tailored search results #

When we search for a song, we want to get the songs that are most relevant to us. We don’t want to get the songs that are most relevant to everyone else. We want to get the songs that are most relevant to us.

A common way to do this is, is through machine learning. We can use machine learning to build a model that can predict the songs that are most relevant to us, based on the songs that we have listened to in the past, the songs that we have liked in the past, and the songs that we have searched for in the past.

Machine learning can also be used to build a model that can predict the songs that are most relevant to us, based on the songs that are most relevant to people who are similar to us. This is called collaborative filtering.

Popularity based search results #

To the mix, we can add popularity based search results. When we search for a song, we want to get the songs that are most popular. By doing so, we will get good results on the majority of the searches, since by definition, the most popular songs are the ones that are most likely to be searched for.

Wrapping up #

Information retrieval can go way deeper than what we have discussed here. But just hovering over the surface, we can see that building a fast and efficient search engine is a multi-faceted problem. It requires multiple iterations to enhance the search results over time, and it requires a lot of data to be able to provide relevant results.

References #