Skip to main content

Command Palette

Search for a command to run...

MUG Search in MongoDB

Updated
4 min read

Mongo DB - data structure

Data in db is stored in separate documents that contain the format of itself. It is written in JSON style as key-value pair. The format can be changed anytime as needed. Each document can have different fields, this makes it polymorphic. As it is document style database, it can be easily scaled in a cluster. Plus, this is great for performance as it leaves a single point for read and write. Mongo has idiomatic drivers to make it easy for development.

Mongo DB Atlas

A developer platform for managing MongoDB. Making queries interactively, seeing data...

They have a mongo uni with achievement progress to give out badges for credly.

Funfacts

  • 82% users leave after bad experience using website search in general.

  • 6 times more users go for search than browse in Amazon

  • 72% e-commerce fail to achieve good search users experience

Search traits

  • performance

  • relevancy

  • cost

  • flexibility

  • security

Simple workflow

Raw text -> Analyzer -> Tokenizer -> Inverted Index -> Ranked results (using BM25 relevancy)

Taking in the string passed to the pipeline. The analyzer takes in the string, splits that apart and puts each word apart. Once that is so, all of it has to get normalized, thus staying all lowercase, removing commas and other special characters. The tokenizer is a processor to build inverted index. That, inverted index is accessed with instant access of complexity O(n). This index will build a ranked list based on all of these tokens or so findings from the result will build up the index and return the ranked result.

Features

Fuzzy matching

To tolerate some typos, fuzzy matching attribute can be set to result more spread. This happens in the result retrieving from the index.

Autocomplete

Edge graph built on vectorized distance between results. The deepness results in the relevancy of autocomplete

Compound Query

To make complex advanced queries, there could be compounding happening in the retrieval. To include or specific words, something to exclude and so. Specific attributes, multiple key value retrieval.

Faceting

Which means to filter and count subqueries.

Lucene

Is underlying engine to power up the Atlas search.

Code

Unix like pipe search aggregation

Usage planning

To plan out the search optimization for best user experience. Stuff to think about comes from user stories directly.

  • What kind of stuff to look for.

  • Which fields of documents to tokenize.

  • What is the optimal fuzziness.

  • What kind of advanced search to provide.

  • How to present the search results

  • Reevaluate all of that for performance optimization

  • Autocomplete, whether to use it and its parameters.

Search types

Lexical vs Semantic

Looking for something vs something without words just based on characteristics

What is a vector?

"Arrow" that has direction and length.

In db vectors are represented in an array of long floating numbers contianing the direction in multidimensional way.

Meaningness of a search is being determined by a distance of metric. Wheter it is cosine, direct angle distance or the endpoint to endpoint distance

"Eclidean" vs "Cosine" vs "dot product"

Pipeline

Data -> encoder -> vector -> semantic search -> RAG

Data is loaded, put through MRI encoder which develops a vector encoding and saves that in encoding attribute. By having a vector for everything, this maps out a complete structure for understanding the system and RAG retrieval.

Vector quantization. Accuracy vs speed. Turning the process from 32 to 16 or 8 bit makes it simpler to process.

The encoding model is the processor to add context. Encoding model is provided by Mongo.

Voyage AI a company nowadays owned by MongoDB.

Search query in code

.aggregate(vector search...)

parameters:

  • exact

  • filter

  • limit

  • numCandidates

  • query

  • index name

Approximation algorythms to choose from is put in the parameters too. aNN vs kNN vs eNN

Simply using hybrid search means to have lexical and semantic search used in parallel. Two versions of scoring can determine the results.

Relative Score Fusion

Runs both and puts the values of approximity together.

Reciprocal Rank Fusion

Runs both, puts them in an order and ranks the results in the list.

Real life example

E-commerce company

Context:

  • product catalog search

  • mongo + regex as search engine

  • 500 000 + promotion a year, 2000+ stores

  • problem definition: Performance + zero typo tolerance

MongoDB proposed solution:

Implement fuzzy search using Atlas Search.