Vector Databases: the Digestive System of LLM Apps

Vector Databases: the Digestive System of LLM Apps

Notes: For more articles on Generative AI, LLMs, RAG etc, one can visit the “Generative AI Series.”
For such library of articles on Medium by me, one can check: <GENERATIVE AI READLIST>

A vector database is a kind of database. It is designed to store, index and retrieve data points with multiple dimensions, i.e vectors. Unlike databases that handle data organized in tables, vector databases are designed for managing data represented in multi-dimensional vector space. They use indexing and search algorithms to conduct similarity searches quickly.

Vector database and traditional database

Traditional databases are structured to handle discrete, scalar data types like numbers and strings. Data is organized in rows and columns. This structure is ideal for transactional data but less efficient for the complex, high-dimensional data typically used in AI.

Vector databases are designed to store and manage vector data in such a way that they become optimized for tasks involving similarity search — which is often an implicitly assumed requirement in AI applications. For this, they use indexing and search algorithms optimized for high-dimensional vector spaces.

Let’s explore the ‘how’ in a little detail.

Rough working of a Vector Database

When a user query is initiated, the vector database performs operations (like similarity searches) to find and retrieve the vectors most similar to the query. This entire process enables the rapid and accurate management of vast and varied data types in applications that require high-speed search and retrieval functions.

Vectors to Embedding Models

They key goal is to find vectors that are similar to the vectorized query.

Now, with traditional keyword search, we run into limitations, mainly because of these issues:

It is often not enough to simply search our data for keywords. We need a way to map the meaning behind words and sentences to find content that’s related to the question.
We also need to make sure that this search is done within milliseconds, not seconds or minutes. So we need a step that allows us to search the Vector Collection as efficiently as possible.

So, we go for Embedding models.

Vector embeddings

Vector embeddings are numerical codes that encapsulate the key characteristics (features) of objects.

For example, songs in a music streaming app. By analyzing and extracting crucial features (like tempo and genre), each song is converted into a vector embedding through an embedding model. This process ensures that songs with similar attributes have similar vector codes.

A vector database stores these embeddings and, upon a query, compares these vectors to find and recommend songs with the closest matching features.

Embedding Models

Embedding models turn words and sentences into vectors with hundreds or thousands of dimensions. Most up-to-date embedding models are highly capable of understanding the semantics/meaning behind words and translating them into vectors.

Pre-trained embedding models from OpenAI, Google, MetaAI, or the open source community help us do this. They learn from a huge corpus of text, how words are normally used and in what contexts. They use this extracted knowledge to map words into a multi-dimensional vector space. The location of the new data point in the vector space tells us which words are related to each other.

What we want to do once we have a collection of vectors, we want to compare them to each other and somehow quantify the similarities between them. Usually we are interested in the k-nearest neighbors, so the data points in our vector space that are closest to our query vector. Calculating distance between them, which is a measure of similarity, requires huge computing power.

This is where vector databases come into play. They incorporate several techniques that allow us to efficiently store and search our collection of text content.

There is another reason.

LLMs can never know everything. It only sees like a frozen version of the world, which depends on the train set the LLM is trained with. Things LLMs don’t know out of the box:

Data that is too new — Articles about current events, recent innovations, etc. Just any new content created after the collection of the LLM train set.
Data that is not public — personal data, internal company data, secret data, etc.

So we need to feed our model with additional data, information that the LLM can impossibly know by itself. And that all needs to happen during the runtime of our application. So we must have a process in place that decides as quickly as possible with which additional data we want to feed our model.

We cannot give the model all the data we have

The short reasons is — the models have a limit, a token limit.

If we don’t want to train or fine-tune the model, we have no choice but to give the model all the necessary information within the prompt. But at the same time, we have to respect the token limits of models.

LLMs have a token limit for practical and technical reasons. The models from OpenAI have a token limit of about 4,000–32,000 tokens, while the open source LLM LLama has 2,048 tokens (if not fine-tuned).

We can increase the maximum number of tokens by fine-tuning, but more data is not always better. 32,000 token limits allow us to pack even large texts into a prompt at once. Whether this makes sense is another matter.

The quality of data is more important than the sheer amount of data, irrelevant data can have a negative impact on the result.

Even reorganizing the information within the prompt can make a big difference in how accurately LLMs understand the task. Researcher from the Stanford University has found that when important information is placed at the beginning or end of the prompt, the answer is usually more accurate. If the same information is located in the middle of the prompt, accuracy can decrease significantly. It’s important to give careful consideration to what data we are providing our model with and how we structure our prompt.

In sum, the quality of the input data is key to the success of our LLM application, so it is essential we implement a process that accurately identifies the relevant content and avoids adding too much unnecessary data. To ensure this, we must use effective search processes to highlight the most relevant information.

Effective Search with Vector Databases

Vector databases are made up of several parts that help us quickly find what we’re looking for. Indexing is the most important part and is done just once, when we insert new data into our dataset. After that, searches are much faster, saving us time and effort.

Approximate Nearest Neighbor algorithms are used to find the closest neighbors, even though they may not always be the exact closest. This trade-off of accuracy for speed is usually acceptable for LLM applications, since speed is more important and often the same information is found in multiple text snippets anyway.

Inverted File Index (IFV) is a popular method for finding similarities between different items. It works by creating a database index that stores content and connects it to its position in a table or document. We divide the whole collection of items into partitions and their centroids. Each item can only be part of one partition at a time. When we search for similarities, we use the partition centroids to quickly find the items we are looking for.

If we’re looking for nearby points, we usually just search in the centroid closest to our point. But if there are points close to the edge of the neighboring centroid, we may miss them.

To avoid this issue, we search multiple partitions instead of just one. However, the underlying problem still remains. We lose some accuracy. But that is usually ok, speed is more important.

Chroma supports multiple approximate nearest neighbor (ANN) algorithms, including HNSW, IVFADC, and IVFPQ.

Hierarchical Navigable Small World (HNSW): HNSW is an algorithm that creates a hierarchical graph structure to quickly store and search high-dimensional vectors with minimal memory usage.
Inverted File with Product Quantization (IVFPQ): IVFPQ uses product quantization to compress vectors before indexing, resulting in a high-accuracy search that can handle massive datasets.

The INDEXing step

Anyway, I think all we need to know is that through the indexing step, we store our embeddings in a form that allows us to quickly find “similar” vectors without having to calculate the distance to all the data points each time. By doing that, we trade speed for some accuracy.

The accuracy of the task should still be good enough for most of what we try to do with it. Translating language into embeddings is not an exact science anyway. The same word can have different meanings depending on the context or region of the world in which we use it. Therefore, it is usually okay if we lose a few accuracy points, but what is much more important is the speed of the response.

Process flow of a chatbot using LLM models

Our vector store takes care of tokenizing, embedding and indexing the data when it’s loaded. Once the data is in the store, we can query it with new data points.

Types of vector stores

We end this article with a brief overview of vector stores.

Vector databases come in different shapes. We distinguish between:

Pure vector databases
Extended capabilities in SQL, NoSQL or text search databases
Simple vector libraries

Text search databases can search through large amounts of text for specific words or phrases. Recently, some of these databases have begun to use vector search to further improve their ability to find what you’re looking for. Elasticsearch uses both traditional search and vector search to create a ‘Hybrid Scoring’ system, giving us the best possible search results.

Vector Search is also gradually being adopted by more and more SQL and NoSQL databases such as Redis, MongoDB or Postgres. Pgvector, for example, is the open source vector similarity search for Postgres. It supports:

exact and approximate nearest neighbor search
L2 distance, inner product, and cosine distance

For smaller projects, vector libraries are a great option and provide most of the features needed.

FAISS is a library for efficiently searching and clustering dense vectors, and can handle vector sets of any size, even those that don’t fit in memory. It’s written in C++ and comes with a Python wrapper, making it easy for data scientists to integrate into their code.

You can check a project using FAISS by me here: https://github.com/the-ogre/LLM-QAwithQuantizedMistral7bLangchainRAGKidneyDisease

Chroma, Pinecone, Weaviate on the other side, are pure Vector Databases that can store vector data and be searched like any other database.

You can check a project using Chroma by me here: https://github.com/the-ogre/LLM-ChatbotONOnThusSpakeZarathustra

StatusNeo