Embeddings and Vector Stores/ Databases

Podcast

Embeddings are basically a cheat sheet for the machines for understanding the meaning of text, images and all kind of data. They store semantic information

Low dimensional numerical representations of almost anything. They are basically designed to capture the under-lined meaning and relationships of different pieces of data

Why are Embeddings important ?

Efficiency, we can represent any kind of data within an embedding
Capture how similar or different two pieces of data are
Lower dimension → Large scale processing

Applications of Embeddings:

Retrieval and Recommendations: Google search: First, we pre-compute embeddings for billions of web pages. When we do a google search, the query gets converted into an embedding too. Finally, we find the web pages whose embeddings are the nearest neighbors to the query embedding in the vector space. The closer the embeddings are in the vector space, the more semantically similar they are

Joint Embeddings : They are all about handling multimodal data. Let’s say we want to search a video based on a text description. Joint Embeddings allow us to map both the text and the images into the same embedding space, making it possible to compare between different types of data

Effectiveness of Embeddings

Precision : Precision is all about how many of the retrieved items are actually relevant
Recall : The proportion of all the relevant items we actually manage to find
Normalized Discounted Cumulative Gain (or) NDCG : Gives higher scores when the most relevant items are at the very top of the results list
Practical considerations for choosing an Embedding model :
- Model Size
- Embedding Dimensionality
- Latency
- Cost

Retrieval Augmented Generation (RAG):

The concept highlights the use of embeddings to find relevant information from a knowledge base and then use that information to boost the prompts that we give to the language model. This helps the language model to provide more accurate and relevant responses

This involves a Two Step Process
Step 1 : To create an index. This involves breaking down our documents into chunks, generating embeddings for each chunk using a document encoder, storing those embeddings in a vector database
Stage 2 : When a user asks a question. We go into the query processing stage, we first take the user’s question, turn it into an embedding using a query encoder, then perform a similarity search in the vector database to find the chunks whose embeddings are closest to the query embedding

Types of Embeddings

Text Embeddings : Helps us represent words, sentences, paragraphs and even entire documents

Tokenization: Breaking down a text string into smaller meaningful units called tokens. Each token in our entire data set gets assigned a unique numerical ID (One Hot Encoding)

Word Embeddings: Dense, fixed length vectors that aim to capture the meaning of individual words, like how each words relate to each other semantically

Word2Vec technique :
Uses CBOW (Predict the target word given surrounding context) or Skip Gram (Predict surrounding context words given a single target word.)
Limitation : Captures local relationships between words within a small window of context

Glove Technique : Tries to capture more global information about how words co-occur - Builds a co-occurrence matrix which basically counts how often each word appears in the context of every other word in the entire dataset

Document Embeddings: Representing the meaning of larger chunks of text

LSA (Latent Semantic Analysis) uses a matrix of word counts and documents, applies dimensionality reduction
LDA (Latent Dirichlet Allocation)

Image Embeddings: Can get Image embeddings by training CNN or Vision Transformers on large image datasets

Multimodal Embeddings

Structured Data Embeddings

For a table, we can use dimensionality reduction techniques like PCA to create embeddings for each row

Graph Embeddings

Training of Embedding models

Many modern embedding models use the dual encoder architecture (An encoder for the query and an encoder for the documents/images or the data we’re working with)
They are trained using contrast loss (Pulling similar embeddings closer and vice versa)
Pre-Training : Training the model on a massive dataset to learn the general represent
Fine-Tuning : On a smaller, more task specific dataset for our specific application

Vector Search

Searching through our embeddings efficiently at scale

Finding items based on their meaning
We first compute embeddings for all our data, store those embeddings in a vector database, then embed the query into the same space and find the data items whose embeddings are closest to the query embeddings
Can use Approximate Nearest Neighbor (ANN) to speed up the search for match
- Locality Sensitive Hashing: Uses hash functions to map similar items to the same bucket, so we only have to search within those buckets
- Tree based methods: Recursively partitioning the search space
HNSW technique

Vector Database : Traditional databases aren’t built for this kind of high dimensional data and similarity based queries. Vector Database are specifically built for this

Application

Information Retrieval
Recommendation Systems
Semantic Text similarity
Classification
Clustering
RAG

Elambharathi Thangavel

Explorer

Day 2