What is indexing?
Indexing as a general concept is used to enhance speed of search and retrieval (ie, lookup) at the expense of storage space. And an index is a data structure that allows us to do this.A
We could scan all our data every time. An index exists so we don’t have to.
Its used in various places like,
- databases
- search engines
- vector stores
- compilers
What does indexing mean in different settings?
So overall with indexing we are trying to do search and retrieval but what that means changes in different settings, as we see below.
For traditional databases, when we do select * from employees where email = "rob@email.com", the point of indexing would be to get those rows of the DB with email as rob@email.com quickly without scanning all rows of the DB.
This is similar to the way books are stored in a library. Instead of looking through the entire library, you go directly to a specific section where the required book is placed.
On the other hand, vector indexes answer: “Which points are closest to this point in high-dimensional space?”
This is called Nearest Neighbor Search (NN) or Approximate Nearest Neighbor (ANN) in practice.
Vector index and vector embedding
Vector indexes are used to store and retrieve vector embeddings. Vector embeddings are a numeric vector associated with an image or text which captures its semantics or characteristics.
Since vectors are mathematical in nature, we can use that property to cluster and store similar vectors in the vector index. When we need to retrieve a vector, we figure out which cluster our search vector could be part of and search for it in that cluster rather than going through all the vectors.
Common indexing techniques
These indexing techinques are used to store similar vectors together. This makes it faster for lookup.
Clustering Based - Inverted File (IVF)
-
Cluster all the available vector embeddings into clusters using some clustering method such as K means clustering
-
For a new query vector, find which cluster this query vector is most similar to and search for other similar vectors in that cluster
-
IVFFLAT (IVF Flat)
- Still figure out clusters. But in each cluster save the vectors in flat format ie, simple list with no further heirarchy
- During search, do a brute force search across all vectors in the closest cluster
-
IVFPQ (IVF Product Quantization)
-
IVFSQ (IVF File System with Scalar Quantization)
Graph Based - Hierarchical Navigable Small World (HNSW) Algorithm
This is a good intro to the topic of vector indexing.