Kyo Suayan | suayan.com | Text Search in MongoDB

Text Search in MongoDB

Friday, March 10th 2023

MongoDB can be used as a search engine by leveraging its text search capabilities, which are provided through the use of text indexes.

Text indexes allow you to search for text within string fields of documents in a collection. To use text indexes, you must first create a text index on one or more fields in your collection. Here's an example:

db.articles.createIndex( { content: "text" } )

This creates a text index on the content field of the articles collection. Once the index is created, you can use the $text operator to search for documents that contain a given term or phrase:

db.articles.find( { $text: { $search: "mongodb" } } )

This returns all documents in the articles collection that contain the term "mongodb" in the content field.

You can also use the $text operator to perform more complex searches, such as searching for multiple terms or excluding certain terms:

db.articles.find( { $text: { $search: "mongodb tutorial -nosql" } } )

This returns all documents that contain the terms "mongodb" and "tutorial" in the content field, but exclude documents that contain the term "nosql".

MongoDB also provides a number of text search options that can be used to refine your search results, such as case-insensitive search, stemming, and phrase matching.

Using text indexes in MongoDB can be a powerful way to implement full-text search functionality in your application, without the need for a separate search engine. However, it's important to note that text search in MongoDB is not as full-featured as dedicated search engines like Elasticsearch or Solr, so it may not be suitable for more complex search use cases.

Using Text Search in an Aggregation Pipeline

Here's an example of using a text index in an aggregation pipeline in MongoDB:

Assuming we have a collection of articles with a title and content field, we can create a text index on both fields using the following command:

db.articles.createIndex({ title: "text", content: "text" })

Now we can use the $text operator in an aggregation pipeline to perform a search on the title and content fields.

Let's say we want to search for articles that contain the word "MongoDB" in either the title or content field. We can use the following pipeline:

db.articles.aggregate([
  { $match: { $text: { $search: "MongoDB" } } },
  { $project: { score: { $meta: "textScore" }, title: 1, content: 1 } },
  { $sort: { score: { $meta: "textScore" } } }
])

In this pipeline, we first use the $match stage to filter the articles that match the search query using the $text operator. The $text operator is used to search for the term "MongoDB" in the title and content fields.

Next, we use the $project stage to include the title, content, and the textScore field in the output documents. The textScore field contains the score of each document based on how well it matches the search query.

Finally, we use the $sort stage to sort the results based on the textScore field in descending order. This ensures that the articles with the highest score (i.e. the best matches) are returned first.

This pipeline will return a list of articles that contain the term "MongoDB" in either the title or content field, sorted by relevance.

Note that when using text search in an aggregation pipeline, the $text operator can only be used in the first stage of the pipeline.

A slightly more complex search criteria

Here's an example of using text search with a more complex search criteria:

Assume we have a collection of books with fields title, author, description, genre, and tags, and we want to search for books that match the following criteria:

The title or author contains the word "history"
The description contains the phrase "ancient civilizations"
The genre is either "history" or "archaeology"
The tags contain at least one of the following words: "Egypt", "Rome", "Greece"

We can create a text index on the relevant fields using the following command:

db.books.createIndex({
  title: "text",
  author: "text",
  description: "text",
  genre: "text",
  tags: "text"
})

And then we can use the $text operator with a more complex search expression in an aggregation pipeline, like this:

db.books.aggregate([
  {
    $match: {
      $text: {
        $search: "history \"ancient civilizations\" (history archaeology) (Egypt Rome Greece)"
      }
    }
  },
  {
    $project: {
      title: 1,
      author: 1,
      description: 1,
      genre: 1,
      tags: 1,
      score: { $meta: "textScore" }
    }
  },
  {
    $sort: {
      score: { $meta: "textScore" }
    }
  }
])

In this example, we're using a search expression that includes several search terms and operators:

history: search for the word "history" in the title or author field
"ancient civilizations": search for the phrase "ancient civilizations" in the description field
(history archaeology): search for either the word "history" or the word "archaeology" in the genre field
(Egypt Rome Greece): search for at least one of the words "Egypt", "Rome", or "Greece" in the tags field

This pipeline will return a list of books that match the search criteria, sorted by relevance. The output documents will include the title, author, description, genre, tags, and score fields, where score is the relevance score of each document based on how well it matches the search criteria.

How relevance score is calculated

In the above example, the relevance score is calculated by MongoDB's text search engine based on how well each document matches the search criteria specified in the $text operator.

When you run a text search query in MongoDB, the search engine uses an algorithm called "term frequency-inverse document frequency" (TF-IDF) to calculate the relevance score of each document in the collection.

TF-IDF is a measure of how important a particular word or phrase is to a document, relative to its importance in the collection as a whole. It takes into account two factors:

Term frequency (TF): the number of times a given word or phrase appears in a document. Documents that contain more occurrences of the search terms will have a higher relevance score.
Inverse document frequency (IDF): the frequency of the search terms across the entire collection. Search terms that appear in fewer documents will have a higher IDF score, making them more important to the relevance score of the documents that contain them.

The TF and IDF scores are combined to give a relevance score for each document. The exact formula used to calculate the relevance score depends on the version of MongoDB you're using, but it generally takes into account factors like the number of occurrences of the search terms, the length of the document, and the distribution of the search terms across different fields.

In the example I provided, we're using the $meta: "textScore" expression to retrieve the relevance score for each document, which is a built-in feature of MongoDB's text search engine. The score field in the output documents will contain a numeric value between 0 and 1, with higher values indicating more relevant matches.

tags: