Tutorial 3: RAG (Ritrieval Augmented Generation)

Youtube video

Github link

This blog post will guide you through the process of implementing Retrieval-Augmented Generation (RAG) using the Langchain framework and Python. RAG is a powerful technique that enhances the capabilities of large language models by grounding them with external knowledge. This allows the models to provide more accurate and contextually relevant responses. We will explore the key steps involved in building a RAG pipeline, drawing insights from a practical implementation in a Colab notebook [Excerpts from “LangChain_Section_3_RAG_record.ipynb - Colab.pdf”].

Environment Setup

Before diving into the implementation, we need to set up our environment by installing the necessary libraries. The following pip commands are used to install Langchain, OpenAI integrations, and the PDF processing library [1]:

pip install langchain langchain-openai langchain-community pypdf
# Additionally, we install the Chroma vector store library:
pip install -qU langchain-chroma

To interact with OpenAI models, you’ll need to set your OpenAI API key as an environment variable:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o")

This sets up the OpenAI chat model that we will use for generation.

Ingestion

The first crucial step in RAG is to ingest your knowledge base. This involves loading your documents and preparing them for retrieval.

Loading Documents

We start by loading the PDF documents using PyPDFLoader from langchain_community.document_loaders. For example, to load the file jess301.pdf, we use the following code:

from langchain_community.document_loaders import PyPDFLoader
file_path = "jess301.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

The loaded documents are stored in the docs variable. The example notebook shows that the loaded PDF contains a certain number of documents.

Splitting

Large documents need to be split into smaller chunks to facilitate efficient retrieval. We use RecursiveCharacterTextSplitter from langchain_text_splitters for this purpose:

from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
len(all_splits)

This code splits the loaded documents into smaller chunks of 1000 characters with an overlap of 200 characters. The add_start_index=True parameter keeps track of the starting index of each chunk in the original document. The notebook shows that the splitting process resulted in a specific number of splits.

Embeddings

To enable semantic search, we need to convert our text chunks into numerical vector representations called embeddings. We use OpenAIEmbeddings

from langchain_openai for this:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Here, we initialize the embeddings model, specifying text-embedding-3-large as the model to use.

Storing in Vectorstore

Once we have the embeddings, we store them in a vector database. This allows us to efficiently find the most relevant chunks for a given query. The notebook uses Chroma as the vector store:

from langchain_chroma import Chroma
vector_store = Chroma(embedding_function=embeddings)
ids = vector_store.add_documents(documents=all_splits)

This code initializes a Chroma vector store using the OpenAI embeddings and adds the split documents along with their embeddings to the database.

Retrieval

When a user poses a query, we need to retrieve the most relevant documents from our vector store. Langchain provides the similarity_search_with_relevance_scores method for this:

query = "Who was Guiseppe Mazzini?" # or query = "Tell me about Gutenberg press?"
results = vector_store.similarity_search_with_relevance_scores(
query, k=6, score_threshold=0.4 # or k=7, score_threshold=0.2
)

This code performs a similarity search for the given query, retrieving the top k results based on their relevance scores. You can adjust the score_threshold to filter results based on their similarity to the query. The notebook demonstrates retrieval for queries like “Who was Guiseppe Mazzini?” and “Tell me about Gutenberg press?”. The output of the retrieval step shows the retrieved documents along with their metadata and relevance scores.

Generation

Finally, we use a language model to generate an answer based on the retrieved context. This involves creating a prompt that includes the user’s query and the relevant context.

Prompt Engineering

We use PromptTemplate from langchain_core.prompts to create a structured prompt:

from langchain_core.prompts import PromptTemplate

prompt*template = PromptTemplate(template="""Answer the given query: {query} \nbased on given context: {context}. Do not halucinate or generate answers based on pretrained data. If you can't find the answer based on the context, simply say 'I don't know because I can't find any relevant context.'""")

This template instructs the language model to answer the query based solely on the provided context and to avoid making up information.

Generating Response

We then combine the query and the retrieved context to create the final prompt and invoke the language model:

context = "\n----------------\n".join([doc.page_content for doc, * in results])
meta*data="\n----------------\n".join([str(doc.metadata) for doc, * in results])
prompt = prompt_template.invoke({"query": query, "context": context})
response = model.invoke(prompt)
print(response.content)
print(meta_data)

The retrieved document content is joined to form the context, and the metadata of the retrieved documents is also prepared. The prompt is then passed to the language model (model), and the generated response is printed along with the references (metadata of the source documents). The notebook shows example responses for the “Giuseppe Mazzini” and “Gutenberg press” queries, demonstrating how the model uses the retrieved context to generate informative answers and cite the source.

Combined Ingestion, Retrieval, and Generation

The notebook also demonstrates a streamlined approach that combines the ingestion, retrieval, and generation steps into a cohesive pipeline. This simplifies the process and makes it easier to build RAG applications.

Conclusion

In this blog post, we’ve walked through the fundamental steps of building a Retrieval-Augmented Generation (RAG) system using Langchain and Python. By loading and splitting documents, generating embeddings, storing them in a vector store, retrieving relevant information based on user queries, and finally using a language model to generate answers grounded in the retrieved context, we can create powerful applications that leverage external knowledge to provide more accurate and insightful responses. The Langchain framework provides a robust set of tools and integrations to simplify the development of such RAG pipelines.