MUG Search in MongoDB
Mongo DB - data structure
Data in db is stored in separate documents that contain the format of itself. It is written in JSON style as key-value pair. The format can be changed anytime as needed. Each document can have different fields, this makes it polymorphic. As it is document style database, it can be easily scaled in a cluster. Plus, this is great for performance as it leaves a single point for read and write. Mongo has idiomatic drivers to make it easy for development.
Mongo DB Atlas
A developer platform for managing MongoDB. Making queries interactively, seeing data...
They have a mongo uni with achievement progress to give out badges for credly.
Search
Funfacts
82% users leave after bad experience using website search in general.
6 times more users go for search than browse in Amazon
72% e-commerce fail to achieve good search users experience
Search traits
performance
relevancy
cost
flexibility
security
Simple workflow
Raw text -> Analyzer -> Tokenizer -> Inverted Index -> Ranked results (using BM25 relevancy)
Taking in the string passed to the pipeline. The analyzer takes in the string, splits that apart and puts each word apart. Once that is so, all of it has to get normalized, thus staying all lowercase, removing commas and other special characters. The tokenizer is a processor to build inverted index. That, inverted index is accessed with instant access of complexity O(n). This index will build a ranked list based on all of these tokens or so findings from the result will build up the index and return the ranked result.
Features
Fuzzy matching
To tolerate some typos, fuzzy matching attribute can be set to result more spread. This happens in the result retrieving from the index.
Autocomplete
Edge graph built on vectorized distance between results. The deepness results in the relevancy of autocomplete
Compound Query
To make complex advanced queries, there could be compounding happening in the retrieval. To include or specific words, something to exclude and so. Specific attributes, multiple key value retrieval.
Faceting
Which means to filter and count subqueries.
Lucene
Is underlying engine to power up the Atlas search.
Code
Unix like pipe search aggregation
Usage planning
To plan out the search optimization for best user experience. Stuff to think about comes from user stories directly.
What kind of stuff to look for.
Which fields of documents to tokenize.
What is the optimal fuzziness.
What kind of advanced search to provide.
How to present the search results
Reevaluate all of that for performance optimization
Autocomplete, whether to use it and its parameters.
Search types
Lexical vs Semantic
Looking for something vs something without words just based on characteristics
Semantic search
What is a vector?
"Arrow" that has direction and length.
In db vectors are represented in an array of long floating numbers contianing the direction in multidimensional way.
Meaningness of a search is being determined by a distance of metric. Wheter it is cosine, direct angle distance or the endpoint to endpoint distance
"Eclidean" vs "Cosine" vs "dot product"
Pipeline
Data -> encoder -> vector -> semantic search -> RAG
Data is loaded, put through MRI encoder which develops a vector encoding and saves that in encoding attribute. By having a vector for everything, this maps out a complete structure for understanding the system and RAG retrieval.
Vector quantization. Accuracy vs speed. Turning the process from 32 to 16 or 8 bit makes it simpler to process.
The encoding model is the processor to add context. Encoding model is provided by Mongo.
Voyage AI a company nowadays owned by MongoDB.
Search query in code
.aggregate(vector search...)
parameters:
exact
filter
limit
numCandidates
query
index name
Approximation algorythms to choose from is put in the parameters too. aNN vs kNN vs eNN
Hybrid Search
Simply using hybrid search means to have lexical and semantic search used in parallel. Two versions of scoring can determine the results.
Relative Score Fusion
Runs both and puts the values of approximity together.
Reciprocal Rank Fusion
Runs both, puts them in an order and ranks the results in the list.
Real life example
E-commerce company
Context:
product catalog search
mongo + regex as search engine
500 000 + promotion a year, 2000+ stores
problem definition: Performance + zero typo tolerance
MongoDB proposed solution:
Implement fuzzy search using Atlas Search.
