Vector Search and Embeddings in Nrtsearch

What is Vector/Embeddings?

Vector/embeddings are a simple data structure consisting of a list of (float) numbers with N dimensions to represent a variety of complex data such as photos, videos, textual data, etc., that cannot be stored in traditional databases or datastores in a searchable way. By utilizing this data structure, search engines that support vector search can find similar documents using KNN (K nearest neighbor) or ANN (approximate nearest neighbor) algorithms.

For example, one can search for photos using textual descriptions such as “vegetarian pizza.” To achieve this, we need to first convert a large number of photos into vector embeddings, such as:

[[1.0, 0.3, 2.1], [0.01, 0.02, 1.24], ...]

using an appropriate algorithm and store them in a database that supports vector search. Subsequently, users’ textual queries will be transformed into vector embeddings, such as:

[1.0, 3.2, 1.3]

The algorithm will then determine the most relevant photos by traversing through a graph and applying similarity algorithms such as cosine similarity, dot product, or others.

You can find more detailed information about vector embeddings in the following articles:

Vector Search Applications

The following is a list of example applications that can benefit from vector search:

  • Photo search

  • Review search

  • Similar businesses search

  • Bot detection

  • Chatbots

Vector Embeddings on Nrtsearch

Nrtsearch is based on the popular open-source search library, Lucene. Vector search support was added to Lucene in version 9+. It uses the Hierarchical Navigable Small Worlds (HNSW) graph to traverse vector data to find the most relevant results. It supports vector-only search as well as hybrid (vector + text search).

More information about Lucene’s support of vector search can be found at:

In the following sections, we will go through different steps to launch and configure an Nrtsearch cluster with vector search support.

Launching Cluster with Vector Search Support

Estimate Cluster Size

Before launching a cluster, you need to figure out how big the data will be.

The general guidelines to calculate the size introduced by embeddings is to use the following formula:

Size in bytes = Total Num docs * Float Size * (Embeddings dimensions + hnsw additional storage)

For example, if your index will have around 390M docs with embeddings of 512 dimensions, the total size for this cluster will be around 817GB:

=> ~390M * 4 * (512 + 12) = ~817GB

Note that the above formula is only for vector fields. Other fields in the index will require additional space.

Configuring Cluster

To use vector search, you need to add VECTOR type field to your index or create a new index with vector fields. See Vector for more information.

Ingestion

You need to add the embeddings using AddDocumentRequests. The embeddings should be added as a jsonified list of floats in the value field. The number of floats should match the number of dimensions specified in the vector field definition. Example request in json format:

{
  "indexName": "vector_test",
  "fields": {
    "photo_id": {
      "value": ["1"]
    {
    "photo_embeddings": {
      "value": ["[0.188423157, 0.246743672, ...]"]
    }
  }
}

You can also use lucene-client to load the documents with vector fields. Example csv input file:

photo_id,photo_embeddings 1,”[0.188423157, 0.246743672, …]” 2,”[0.188423157, 0.246743672, …]” 3,”[0.188423157, 0.246743672, …]”

Optimizing Search Queries

The vector hits value represents the number of documents traversed during the vector search. It is the number of vector comparisons, which is the major factor in query performance. It plays the most important role in terms of search latencies and accuracy. Any change that reduces the vector hits number, will decrease the latencies in expense of reducing accuracy.

A summary of trade-offs for each config:

  • Improve Search Latency

    • Lower num_candidates

      • Lower vector hits

      • Lower accuracy

    • Lower indexing parameter values (hnsw_m, hnsw_ef_construction)

      • Lower vector hits

      • Lower accuracy

      • Lower indexing work

      • Requires reindexing

  • Improve Indexing Throughput

    • Lower indexing parameter values (hnsw_m, hnsw_ef_construction)

      • Lower vector hits

      • Lower accuracy

      • Lower indexing work

      • Requires reindexing

  • Improve Accuracy

    • Higher num_candidates

      • Higher vector hits

      • Higher accuracy

    • Higher indexing parameter values (hnsw_m, hnsw_ef_construction)

      • Higher vector hits

      • Higher accuracy

      • Higher indexing work

      • Requires reindexing