David J Jaenisch

Vector Embedding Technologies

Introduction

In the rapidly evolving landscape of Artificial Intelligence, the ability to understand and process vast amounts of unstructured data—like text, images, and documents—is paramount. Vector embedding technologies provide the fundamental bridge between complex, human-centric information and the numerical language of machines. By converting any data object into a high-dimensional vector (a series of numbers), we unlock the ability to perform powerful mathematical operations like similarity search, classification, and relationship mapping. This document provides an interactive journey through the core concepts, practical algorithms, and innovative applications of vector embeddings, from the foundational models that create them to the sophisticated techniques that retrieve and analyze them at scale.

Fundamental Concepts

Embeddings Model

An embeddings model takes arbitrary text, image or document and converts it into a series of numbers, otherwise known as a vector embedding. Embeddings models are created as a part of the training process for LLMs. LLMs use embeddings models to convert inputs into numerical values that can be processed.

Vector Store

Databases can contain vector embeddings as columns for data objects, which can then be searched.

Vector Embedding

A vector used by a particular embedding model. Vectors have different embedded meaning based on the model used to encode them. Vector embeddings are therefore specific to the embeddings model used to encode them.

LLM Summarization

As a pre-processing step, an object can be summarized into text with a specific length and format by an LLM. This dramatically improves the quality of KNN matches and can remove erroneous, unrelated information from the objects being embedded. LLM Summarization can also be used to extend data which is insufficient, not just shrink it, but can hallucinate if this is done improperly.

Non-Embedding Vectors

Many of these techniques work on dimensions that are generated without an embedding space as well. For example, a user could generate a vector store which contains the physical attributes of objects.

Targeted LLM Summarization

A pre-processing step which factors in specific aspects of the object being analyzed. You can target LLM summarization by specifying which things should be summarized. For example, if the user data were cooking blogs, this could specify either the blog information, or the cooking recipe itself.

Notes on Speed of Fundamental Operations

Big O notation is useful for understanding problem scaling. However, for the purposes of vector embedding, creating vectors and LLM summarization rarely scale to an appropriate time for completion for anything better than O(N) due to the high cost of its base operation. The value of reliance on vector embeddings is that it allows us to do simple mathematical operations instead of relying on LLMs and embeddings models, which can be quite slow by comparison. Vectors also have an advantage over LLM context window and LLM finetuning in that they can contain thousands or millions of options for retrieval without hallucination and with consistent effect.

5-60s

Creation of an LLM Summarization

<5s

Creation of a Vector

<1ms

Comparison of Two Vectors

Create a Vector Store

Purpose:

Used for store vectors so that algorithms can be performed on them. The most common algorithm is vector retrieval.

How to:

Convert the data you want to search into vector embeddings using an embeddings model. You can find the most powerful embeddings models available on the Embeddings Leaderboard. Examples from 2025 include Qwen3-Embedding-8B and gemini-embedding-001. This is a simple matter of calling a function in python around these models and can be done in under to a few seconds per vector embedding. Many embeddings models can be run on conventional hardware, even your own personal laptop! After you have converted your data into vector embeddings, you can then store each vector in a database associated with an object or reference to the data it is produced from.

Simple Vector Retrieval

Purpose:

This will retrieve a list of the top most similar objects in the database to the one you are searching with, sorted by similarity. Can also be done within a category or group explicitly.

How to:

Perform K-Nearest Neighbors (KNN) to find a list of the objects in the database that are most similar to the vectorized version of the object you are retrieving. Similarity is based on how the embeddings model is trained. For example, a single vector could be trained on number of times a repeating character occurs, or on the meaning of its words. However, each vector usually represents some general direction of semantic meaning. For example, if your database contains king, queen and rook and you search for man, the results will be king, followed by queen, followed by rook.

Similarity Graph Generation

Purpose:

Creates a graph database based on the relatedness of given vectors. Can be used to find clusters or general similarity, or if LLM summarization is used, similarity with a specific purpose in mind. For example, an embeddings vector database could be used as a basis to find the relatedness of types and purposes of furniture, while a numerical vector database could be used to find which furniture is closest in size.

How to:

Find the distance between every entry in the vector store, with a minimum threshold for similarity. Add the objects to a graph database, with the pairs above a certain threshold as edges. The user could also place a maximum on number of similar items.

The user can also cluster entries together and take the average of each cluster to create a graph of graphs, for example city transit networks which are connected to other cities through transit lines. DBScan is a good clustering algorithm to use in this case.

Matchmaker Retrieval

Purpose:

To match every member of one group with exactly one member of another group. This can also be used within the same group, provided that self-matching is excluded.

How To:

Start by making two separate vector stores, which contain the two groups that you want to match up. For example, you could make one database of main dishes, and one database of sides. Next, produce a matrix of distances between each of the vectors in each group. Finally, run all of these vectors through the Hungarian Algorithm.

Vector Retrieval of Groups

Purpose:

To match the query’s vector against pre-defined subgroups within the vector database, instead of single entries, and to compose those groups. A common application for this would be category assignment. A category could be seeded with a few values and then added to over and over again based on maximum distance requirement. Can also he used to categorize the categories to create groups of groups of groups.

How To:

This can be conducted using KNN similarly to simple vector retrieval. However, the comparison needs to be made in one of many ways.

  • An average of all members of the group can be used for this comparison, but this has the disadvantage that members of that group which are overlapping and repetitive may have an outsized influence on the results. This can be useful for analyzing databases which have individual members that can be overlapping heavily and where the complete dataset is present. For example, this would be very powerful when analyzing teams in an organization. But it would be problematic if only a random sample of users are present.
  • Define the space by the extrema in each dimension of the category. Then, determine if the query is within that space. If it is not, find the nearest edge to it and retrieve that group.
  • There are a wide variety of techniques for finding the center of a group of points, all with their benefits and drawbacks.

Notes on Category Assignment:

You can initially set up categories using DBSCAN or similar algorithms. You can also create categories by assigning a maximum distance to center, and adding an item to an existing category, and if the item it outside the maximum distance to center, create a new category. This will cause order dependance, however. Also, you should not attempt to create new categories by determining what is within convex hull bounds of existing categories without expanding edges, because it will make new categories for every new entry.

Diverse Responses to Query

Purpose:

You have a lot of similar entries in your vector store (for example thousands of earrings and very few shirts, pants, etc) but want a diversity of responses.

How To:

Conduct Vector Search using Maximum Marginal Relevance (MMR) algorithm. MMR uses subtraction/diffs of already found close vectors in order to find diverse results.

Within Category & Region Queries

Within Category Query

Purpose:

Search within a specific category. Categories can also be used to validate results.

How to:

Pre-filter the results to select from by category and then do a simple vector retrieval.

Within Region Query

Purpose:

To limit entries in a search to ones within a specific space, rather than ones that have been categorized.

How to:

Based on a list of entries, define a convex hull containing all points. Then, expand slightly in every direction. Alternatively, in small lists, you can radially expand from each position until all of them touch another one. You may need to remove outliers before trying either of these methods. You can also use a gravity model to define the relevant region, but this will be more constrained. Use the bounds of this new region as a filtering mechanism for queries. Only query responses within the bounds.

Diverse within Fill Query

Purpose:

To retrieve a list of entries which are complementary, rather than simply a list which of the nearest responses, but importantly fit within specific bounds. Examples might include assembling a team, putting together an outfit, or building a machine of compatible parts. A user could then use an existing category, i.e. “garments that cover the groin” and “garments that cover the chest” as a validation at the end.

How to:

First, define a region as in Within Region Query. Then, conduct MMR search within that region.

Statement Dissection/Concept Extraction

Purpose:

Categorizing sentences in a paragraph. Extracting concepts and verbs from a sentence. Building a concept graph from a document at scale, which links between concepts in sentences, categories and their locations in the documents. Can also link subject object verb and create graphs of paragraphs, which then connect to other graphs in the documents by word overlap or by similarity on the paragraph level.

How to:

Take the paragraph or sentence. Grab each word, phrase or sentence. Drop irrelevant common words if word or phrase. Categorize common words with a dictionary if using this for words. Test each against a category, or against type categories (addresses, people names, place names, etc) or create new categories that can be placed into a graph and analyzed for rarer commonalities.

Summary

Vector embeddings are a technology powered by applied mathematics. They provide a robust, efficient, and scalable framework for converting complex data into a structured format amenable to computation.

By leveraging simple geometric principles like distance and clustering in high-dimensional space, we replace expensive, non-deterministic LLM calls with predictable, high-speed calculations. This allows for the creation of systems that can search, recommend, and categorize millions of items with consistent effect and without hallucination, forming the operational backbone of modern, data-intensive AI applications.