Why do we use Elasticsearch ?
Table of Contents
When I did not know anything about Elasticsearch, I thought it was a plug-in search-as-a-service like Algolia, that you would plug to your existing data storage. I was confused to learn it is an actual data storage system. It left me wondering why would we use ES instead of a standard data storage like PostgreSQL. In reality, there are many reasons.
Real time indexing #
ES is a search centric data storage, and when it comes to faster search, indexing is a very important concept. ES achieve near real-time search by indexing the published documents within one second.
This is possible thanks to ES being based on the library Apache Lucene. Without going into the details, this library creates an in memory segment of the index with the latest published documents, which is a cheap process, hence not degrading the performances.
The documents in this segment are already visible to search without a full commit to disk, which is a more expensive process. After a specific amount of time or a certain number of changes, the segment is written to the disk and merged to the existing index.
The process of opening and writing a new segment is called a refresh, and is by default performed every second on all the indices that received at least one query during the last 30 seconds.
Full-text search #
Full-text search is possible in most data storage, but elastic-search takes it to another level by providing advanced features and optimizations for search performances.
FTS in ES is possible by indexing text fields with different possible analyzers. What this means is that, there are different process that can be applied to the text of a document before its indexation.
The analyzer’s job is to break the text into token, or terms, that can be searched efficiently.
The standard analyzer splits the text into words, that are further converted into lowercase. And while searching, it would look for an exact matching of one of the word.
For instance, with the standard analyzer, if in the document to index, you have the word Python
, the document would match with a query with the word python
, since it is just the lowercase, but the documents would not be returned for a query containing a partial match like pyton
for instance.
Note that the standard analyzer perform other processes to enhance the relevance of the results, like removing punctuation or filtering out the stop words.
There are many more analyzers you can use to index your documents. You can choose a different analyzer for each text field of the document to index. Some are much more complexes, like the N-Gram analyzer, which would tokenize a field containing the value tea
in the following tokens: t
, te
, tea
, e
, ea
, a
Additionally, Elasticsearch will run a stemming process on its text for analysis, which, combined by an autocomplete feature, will produce a great experience for a user facing search engine.
Scalability #
Elasticsearch’s scalability seems when compared to the best database available (imho) PostgreSQL 🐘💖:
Distributed architecture #
PostgreSQL, as a more traditional database, can be scaled vertically by adding more hardware resources. Elasticsearch, on the other side, is designed from the ground up to be a distributed system distributed architecture, which allows it to scale horizontally by adding more nodes to a cluster.
PostgreSQL’s scalability is limited by the capacity of a single server when you can combine the power of multiple server to increase the capability of your Elasticsearch (which come with a few headaches tho)
Sharding #
Both Elasticsearch and PostgreSQL support sharding, which is the act of dividing data into smaller subsets that can be managed independently.
However, Elasticsearch’s sharding is more flexible and dynamic because it can rebalance data automatically as nodes are added or removed from a cluster. PostgreSQL’s sharding requires more manual management and may not be as easy to scale as Elasticsearch.
Replication #
Both Elasticsearch and PostgreSQL support replication, which is creating copies of data to improve availability and performance.
But Elasticsearch’s replication is more flexible and can be configured to provide real-time updates and automatic failover, when PostgreSQL’s replication requires more manual management and may not be as robust.