At Veeqo, we've been actively using ElasticSearch for many years. Like many other Ruby developers, we started by using the Searchkick gem back in the day. Searchkick makes using ElasticSearch really flawless and easy. You don't have to know ElasticSearch query language, analysers, tokenizers and bunch of other guts to start using full text search in your Ruby on Rails project, perform complex aggregations and ensure data re-indexing works just fine.
Let's see what the typical flow is when you end up with a need to add a full text search to your application.
Assuming you have Order
model, you want to give your users the ability to search for an order number, product names, product sku codes, customer name and address. Following the Searchkick README you would add these lines to your model:
class Order < ActiveRecord::Base
SEARCHABLE_FIELDS = %i[
number product_titles product_sku_codes customer_name customer_email customer_address
]
searchkick(
searchable: SEARCHABLE_FIELDS,
word_middle: SEARCHABLE_FIELDS
)
end
Then you reindex your orders either synchronously or asynchronously, and perform text search like this:
Order.search('apple', match: :word_middle)
And everything works like a charm... to start with! Then you start to notice that your ElasticSearch cluster feels bad. It either starts responding with a huge delay - up to 30 seconds, instead of a usual split second. Or it drops 5xx errors and ends up in denial of service.
You scratch your head. As business grows well, the amount of data increases. Your ElasticSearch cluster just has to be upgraded - instances become bigger, and a number of nodes also grows. Albeit this is not a trivial operation and it usually requires some DevOps knowledge to perform these things, you don't see any other options but upgrade.
Time flies, and after a month, a quarter, a year the incident repeats. This time you look at your transaction charts in NewRelic and see that the endpoint that utilises text search waits for ElasticSearch response for seconds. This is a pretty big deal for a web transaction. And this is only one component of your web request. You also have DB querying, rendering and bunch of other operations to be done in order to serve user request. All the while, your apdex score starts to drop inexcusably low.
What did we do wrong? Is it because we grew too fast? The used solution looks pretty straightforward, doesn't it? Let's have a look.
A little of theory
To understand what's going on, let me very briefly explain how ElasticSearch text search works.
Text search basically contains of 2 components:
- Search data analysis
- Search query
Search data analysis
In order to make text search fast, smart and effective - what we all love about ElasticSearch, it has several mechanisms.
Upon writing a text field into ElasticSearch document, you tell ElasticSearch what to do with the string you provided before it gets written to the index.
This "what to do with the string" process is called analysis. Analysis contains of 2 steps - filtering and tokenization.
Filtering gets your text prepared for the following step - tokenization. It usually does things like removing non-ASCII characters, lowercasing the text, and so on.
Tokenization is responsible for how to split the text before it will be stored in ElasticSearch index.
There are plenty of tokenizers available in ElasticSearch out of the box. Let's have a look at two of them particularly used by Searchkick.
Word Oriented Tokenizers
The most simple tokenizers are word oriented tokenizers, for instance - whitespace tokenizer.
With a whitespace tokenizer the text "Glass of beer"
is getting split into 3 tokens: ["Glass", "of", "beer"]
This means, when you are searching for the word "beer", ElasticSearch looks for the exact token associated with any document in the index. It will not find anything if you type "bee" or "lass".
Ngram Tokenizers
But what if you want to get the "Glass of beer"
by typing "bee" or "lass"? This is a pretty unnatural example, so let's have a look at the another one.
Given the order with a number N-#103190991313
you want your system to give you back this order if you type the beginning of its number - 103190991
.
Ngram tokenizer comes to the rescue in this case. In spite of variety of word oriented tokenizers, ngram tokenizers create meaningless tokens by slicing text in differently sized chunks according to the definition of the specific ngram tokenizer.
Given a phrase redfox
Ngram tokenizer defined by Searchkick with a min_ngram = 1
andmax_ngram = 50
will slice this text into tokens with a minimum token size of 1, and a maximum token size of 50 in all possible combinations. Which for this small phrase gives us 21 token:
['r', 'e', 'd', 'f', 'o', 'x', 're', 'ed', 'df', 'fo', 'ox', 'red', 'erf', 'dfo', 'fox', 'redf', 'edfo', 'dfox', 'redfo', 'edfox', 'redfox']
With a certain degree of simplification, if a user queries e
or df
or edfox
it will get the redfox
result. Of course, ElasticSearch is smart enough and it checks if these two letters occur in a large set of documents it's unlikely that you are searching for this particular document and so it may omit it. But the token is there. It is getting stored on disk, and it is considered during every query to ElasticSearch.
Search Query
Here we came to the second part of the search process - querying. ElasticSearch Query DSL is a very powerful tool to query data, and it has lots of components.
The most interesting for us is match query. The simplest text query looks like this:
GET /orders/_search
{
"query": {
"match": {
"number": {
"query": "100230"
}
}
}
}
Sometimes you want to get fuzzy results, to cover the cases when a user made a typo or they are not sure how to spell the word correctly:
GET /_search
{
"query": {
"match": {
"message": {
"query": "100230",
"fuzziness": 1,
"prefix_length": 0,
"max_expansions": 3,
"fuzzy_transpositions": true
}
}
}
}
Assuming we have orders with IDs: ['N-100230', 'N-100233', 'N-100430', 'N-222230']
ElasticSearch will return not only N-100230
, but also N-100233
and N-100430
because they differ by 1 character (fuzziness
parameter) compared to the original query. Fuzziness makes ElasticSearch to perform an extra work here, calculating the Levenshtein distance between the terms that do not exactly match each other to find up to 3 (max_expansions
parameter) similar strings.
Back to the real world problems
Field mapping
Word Middle Settings
Remember the basic definition of the Searchkick index parameters for the Order model?
class Order < ActiveRecord::Base
SEARCHABLE_FIELDS = %i[
number product_titles product_sku_codes customer_name customer_email customer_address
]
searchkick(
searchable: SEARCHABLE_FIELDS,
word_middle: SEARCHABLE_FIELDS
)
end
Let's examine what happens under the hood. Let's start from the field mapping stored in ElasticSearch. It can be retrieved by running Order.search_index.mapping
. I have omitted all but one field, because the rest of searchable
fields have absolutely the same mapping.
{
"orders_development_20201130162432914": {
"mappings": {
"order": {
"properties": {
"customer_name": {
"type": "keyword",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "searchkick_index"
},
"word_middle": {
"type": "text",
"analyzer": "searchkick_word_middle_index"
}
},
"ignore_above": 30000
}
}
}
}
}
}
What do we see here? A field customer_name
is in fact 3 fields:
customer_name
- a keyword fieldcustomer_name.analyzed
- a text field with asearchkick_index
analyzer appliedcustomer_name.word_middle
- a text field with asearchkick_word_middle_index
analyzer applied
Another thing to mention is ignore_above
setting, that states that ElasticSearch should only index first 30k characters of the provided text.
searchkick_index
is analyzer with a typical word oriented tokenizer inside. Searchkick adds a bunch of bells and whistles there to make it more sophisticated, but the basics remain the same - tokens are words.
You can check the definition of searchkick_index
by running Order.search_index.settings
:
{
"searchkick_index": {
"filter": [
"lowercase",
"asciifolding",
"searchkick_index_shingle",
"searchkick_stemmer"
],
"char_filter": [
"ampersand"
],
"type": "custom",
"tokenizer": "standard"
}
}
And you can play with that analyzer to see what tokens it produces by running:
Order.search_index.tokens('#P-101041901901', analyzer: 'searchkick_index')
=> ["p", "p101041901901", "101041901901"]
More interesting is searchkick_word_middle_index
. Let's take a look at its definition:
{
"searchkick_word_middle_index": {
"filter": [
"lowercase",
"asciifolding",
"searchkick_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
{
"searchkick_ngram": {
"type": "nGram",
"min_gram": "1",
"max_gram": "50"
}
}
searchkick_word_middle_index
analyzer uses ngram with a range from 1 to 50. Let's check what it produces:
Order.search_index.tokens('101041901901', analyzer: 'searchkick_word_middle_index')
=> ["1", "10", "101", "1010", "10104", "101041", "1010419", "10104190", "101041901", "1010419019", "10104190190", "101041901901", "0", "01", "010", "0104", "01041", "010419", "0104190", "01041901", "010419019", "0104190190", "01041901901", "1", "10", "104", "1041", "10419", "104190", "1041901", "10419019", "104190190", "1041901901", "0", "04", "041", "0419", "04190", "041901", "0419019", "04190190", "041901901", "4", "41", "419", "4190", "41901", "419019", "4190190", "41901901", "1", "19", "190", "1901", "19019", "190190", "1901901", "9", "90", "901", "9019", "90190", "901901", "0", "01", "019", "0190", "01901", "1", "19", "190", "1901", "9", "90", "901", "0", "01", "1"]
78 tokens! Compared to 3 with the searchkick_index
analyzer. Every token increases the size of the index and indirectly affects performance of search. Of course, It does not mean you should never use a word middle search. Let's see if we can improve it.
Now forget about ElasticSearch for a minute, and think what are the use cases for text search in your application. At Veeqo, we realised that in the UI we already have a validation for query size to be no less than 3 characters long (searching by 1 or 2 characters just does not make much sense producing lots of irrelevant results). This means that with Ngram 1-50 we're not receiving from users a 1-2 characters query and therefore not using tokens with 1 or 2 characters. At all.
A reasonable conclusion to this is to increase the lower limit of Ngram from 1 to 3 to match the business model. Another question is the upper limit. The search query comes from users input. A user would rarely type 50 characters to the search bar. They could paste-in some text data from outside, indeed. But a single word size is rarely larger than (calculated based on our data available) 14 characters.
One important thing to mention - before the Ngram tokenizer comes into play, the sentence gets split into words and therefore analyzer is applied not to the entire sentence altogether, but to every word separately one by one just as if they were different phrases each:
Order.search_index.tokens('red fox', analyzer: 'searchkick_word_middle_index')
=> ["r", "re", "red", "e", "ed", "d", "f", "fo", "fox", "o", "ox", "x"]
Baring this in mind, at Veeqo we came up with the following custom analyzer:
class Order < ActiveRecord::Base
searchkick(
settings: {
analysis: {
filter: {
veeqo_ngram_three: {
type: 'nGram',
min_gram: 3,
max_gram: 14
}
},
analyzer: {
veeqo_word_middle_dictionary_index: {
type: 'custom',
tokenizer: 'keyword',
filter: %w[lowercase asciifolding veeqo_ngram_three],
char_filter: %w[ampersand]
}
}
}
},
mapping: {
order: {
properties: {
'customer_name' => {
type: 'keyword',
ignore_above: 512,
fields: {
analyzed: {
type: 'text',
analyzer: 'searchkick_index'
},
word_middle: {
type: 'text',
analyzer: 'veeqo_word_middle_dictionary_index'
}
}
}
}
}
}
)
end
Checking the amount of tokens it produces:
Order.search_index.tokens('101041901901', analyzer: 'veeqo_word_middle_dictionary_index')
=> ["101", "1010", "10104", "101041", "1010419", "10104190", "101041901", "1010419019", "10104190190", "101041901901", "010", "0104", "01041", "010419", "0104190", "01041901", "010419019", "0104190190", "01041901901", "104", "1041", "10419", "104190", "1041901", "10419019", "104190190", "1041901901", "041", "0419", "04190", "041901", "0419019", "04190190", "041901901", "419", "4190", "41901", "419019", "4190190", "41901901", "190", "1901", "19019", "190190", "1901901", "901", "9019", "90190", "901901", "019", "0190", "01901", "190", "1901", "901"]
55 tokens. In this example we've got 23 less tokens to store and search through. Considering it is not the only field with the word middle search out there, the size of the index shrinks even further.
Text Fields That Should Not Be Text fields
Another hint to optimise the index size comes from the business. Some fields just don't need full text search power. For example, a product sku in orders search. The use case involves in copying the entire SKU number. If we narrow the mapping of product_sku_codes
field down to just keyword
(which can not be analysed at all), we will be always searching for the exact match which is much faster and also reduces the index size a lot.
class Order < ActiveRecord::Base
searchkick(
mappings: {
order: {
properties: {
'product_sku_codes' => {
type: 'keyword'
}
}
}
}
)
end
Field Mapping Optimisation Results
After having all these optimisations allows our indices to shrink in size from 2 to 5 times depending on the kind of data we have in those indices. For example, the old orders index was 459.5gb
- we were able to reduce it to 150gb
.
Search Query
After having mapping adjusted, it turned out that the Searchkick Order.search('query')
no longer worked as expected and was missing results.
The reason for this is that the mapping and text search query should be pretty well coupled together. When you change one of them, the other should be adjusted too.
Before you even start optimising the search, make sure you have tests covering all reasonable search edge cases to make sure you don't break anything after the rework is done.
Assuming you've got it covered, lets have a look at the query that matches the mapping we defined above:
Order.search(
nil,
body: {
query: {
dis_max: {
queries: [
{
match: {
"customer_name.analyzed" => {
query: term,
boost: 10,
operator: 'and'
}
}
},
{
match: {
"customer_name.word_middle" => {
query: term,
boost: 1,
operator: 'and',
analyzer: 'searchkick_word_search'
}
}
},
{
match: {
"customer_name.word_middle" => {
query: term,
boost: 10,
analyzer: 'veeqo_keyword_search'
}
}
}
]
}
}
}
)
# veeqo_keyword_search analyzer definition:
# veeqo_keyword_search: {
# type: 'custom',
# tokenizer: 'keyword',
# filter: %w[lowercase asciifolding]
# }
Here we make 3 internal queries to ElasticSearch:
- One with word match for some meaningful text (i.e. "The Red Fox Jumps Over The Lazy Dog"), and boost it by 10 to show these matches on top on the output
- One with word middle, to get a match "Astrophysics" if user types "physics"
- And one that picks up text chunks of 1-2 characters that are filtered out by our updated Ngram configuration (3-14 mapping instead of 1-50 Searchkick's default one).
If you don't need to search by word middle, you just leave the first query.
Fuzzy Matching
This is great feature that helps users to get proper results if they made a typo (just like when you search in Google), but it also has its performance costs and you better know which fields require fuzzy matching and which do not.
Searchkick uses fuzzy matching by default on all searchable text fields.
You can disable it per query with misspellings: false
option like this:
Order.search('fox', misspellings: false)
The thing is, sometimes it is handy - if you type bicicle
instead of bicycle
you still get the results. But sometimes it just doesn't make any sense. For instance, if you search for a part of an order number: 100501
you will get both 100501
and very weird results like 900501
or 108501
which are not even close to what you're looking for. Searchkick does not provide a way to enable misspellings on a per field basis, so we ended up disabling it completely and got yet another solid performance improvement boost.
Still, since we had to learn how to query ElasticSearch manually and build this query on our own, we now control where to apply fuzzy matching and where not.
Other optimisations & challenges
If you have a page with filters in your application that is actively used by your users, you may end up with lots of complicated SQL queries to get the filtered results. Those queries should be well tested to work with other filtering queries. A bunch of heavy joins causes query conflicts. Imagine combining those filtering results with the text search results - you'd be in a pretty tough situation.
Here at Veeqo, we've not only optimised ElasticSearch text search but also made use of ElasticSearch filtering and aggregation functionality. Instead of having complicated joins that affect DataBase performance, we ended up building de-normalised document structure in ElasticSearch to fit all the filtering needs.
This story is worth a seperate article, as it has its own pros, cons and challenges.
Everyone loves charts
The chart below demonstrates the outcome of the ElasticSearch text search optimisation. Notice the blue part of the chart - it is the time of response from ElasticSearch. Before November 13th the average response time from ElasticSearch was ~1000ms. After November 14th it became ~35ms. That's 28 times faster!
While working on the indices optimisation, we ended up running a fresh new ElasticSearch cluster. The reason for this decision was that the old ElasticSearch cluster was not simply capable of processing the additional workload required to add optimised indices on top of it. A side-effect of this decision was the opportunity to upgrade ElasticSearch from 5.6 to 6.8, so we killed two birds with one stone.
That being said, we are now able to compare the overall health of both clusters. The new cluster, running the same number of indices (but optimised), has an average response time about ~40ms compared to the old cluster with unstable 1s which could any time turn into 2s, 5s or even 15s. And this is with the smaller new cluster size which is 2 times smaller and cheaper compared to the old one!