The Information Retrieval & Search Glossary

Information retrieval is a field of its own in the world of IT. Especially when considering how complex it can get, and the depth of knowledge that can be required to operate search systems.

As I am currently in a job working on a Search Platform, I’ve been exposed to a lot of new concepts and terms which is intimidating, but for which I needed to learn a lot.

Here is my attempt at making a non-exhaustive glossary in order to confidently apprehend the domain.

Information retrieval: The process of finding relevant information from a collection of resources based on user queries.
Index: In the context of information retrieval, an index is a data structure that is optimized for fast data retrieval operations. It has to be built on top of data, and maintained on every addition/removal from the data. So it represents an overhead on writes, but a great speed up (often unavoidable) on reads. We call indexing the process of adding new data to the index.
Lexical matching: The process of matching exact terms for the query with terms in the documents.
Stemming:The processof reducing words to their base root form to improve the search results. For instance, we can reduce “running” to “run”, so that any query with either of these words would match.
Semantic Search: A search method that aims to improve the search accuracy by understanding the contextual meaning of the terms in the query
Embedding: The process of mapping words or phrases to vectors (or embeddings) of numbers, capturing the semantic meaning. It can be of a variable number of dimensions.
TF-IDF: TF means Term Frequency, it describes the numbers of times a term appears in a document (Usually we use the logarithm of the term frequency in order to avoid saturation). IDF is a measure of how common or rate a term is across all the documents in the corpus (the rarer a term, the higher its IDF score).
BM25: Best Matching 25, is a ranking function used by search engines to estimate the relevance of documents according to a given search query. It is one of the most used probabilistic information retrieval model. It uses the TF-IDF at is core.
NDCG: Normalized Discounted Cumulative Gain is a measure of ranking quality, used to evaluate the usefulness, or gain, of a document based on its position in the result list.
KNN: K-Nearest Neighbors, is an algorithm used to find the k closest points in a vector space. Usually used to find the closest k document for a query, it requires the pre-computing of the documents embeddings, and the computing of the embedding representing the search query. The proximity of the embeddings can be computed using different similarity metrics, like l2 norm, dot product, cosine, etc…
ANN: Approximate Nearest Neighbors, is an algorithm used to find approximate nearest neighbors. It is usually used in place of a KNN, because computing the similarity of a query with all the documents in the corpus would often be too expensive and unscalable. In high-dimensional spaces, it is a trade-off of some accuracy and space for speed. It can use different kind of algorithms to reduce the number of candidates to the vectors similarity computation.
HNSW: A Hierarchical Navigable Small World is one example of an efficient algorithm for ANNs. It uses a Small-World graph (or network), which is characterized by short paths between pairs of points. Allowing efficient navigation through the network by crossing the multiple layers.

I highly advise anyone interested in the domain to learn more about Information Retrieval in general, and even if possible to read books about it, like Relevant Search, or AI-Powered Search