Building a Vector Search Engine

📌 Searching Historical Figures using Qdrant and OpenAI Embeddings

Welcome to this article, where we will walk you through the process of building a robust search engine that lets users find historical figures based on their search queries. By leveraging Qdrant[1], a high-performing vector database, in conjunction with OpenAI embeddings[2], we can create a fast and precise search experience that meets your needs.

How it works?

To create a robust search engine for historical figures, we need to make sure that users can easily input their queries and receive relevant results. By using Qdrant for storage and retrieval of vectorized data and OpenAI embeddings for converting raw text into meaningful vectors, we can provide a fast and precise search experience that meets the needs of our users.

Prerequisites?

Before we start, it's important to have a few things in place:

Firstly, we need to set up a local Qdrant server instance using Docker containers[3]. This will serve as our vector database.
Additionally, we'll need to have a Python environment set up with the necessary libraries, including openai, qdrant-client, and pandas.
Finally, we'll need a valid OpenAI API key to generate embeddings for queries. Once we have all of these components in place, we can begin working on our project[4].

Step 1: Launching Qdrant Server

To launch the Qdrant server instance, please run the following command in your terminal or command prompt:

docker-compose up -d

This will start the server instance and map port 6333 to your local machine. Once the server is up and running, you can proceed to the next step. To verify the successful launch, use the following command:

curl http://localhost:6333

Step 2: Installing Required Libraries

Please run the following command in your terminal or command prompt to install the required Python libraries:

pip install openai qdrant-client pandas

Step 3: Setting Up Your OpenAI API Key

To acquire an OpenAI API key, please visit this site. Once you have obtained your key, it is important to add it to your environment variables as OPENAI_API_KEY. Here's how you can do it:

 import os
 
os.environ["OPENAI_API_KEY"]="OPENAI_API_KEY"
embeddings = OpenAIEmbeddings()

Please make sure that you have followed the steps outlined in the previous message to add your OpenAI API key to your environment variables. Once you have done so, you can run the following code to verify that your API key has been added successfully:

 if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

Step 4: Connecting to Qdrant

To connect to a running Qdrant server instance, you can use the official Python library[5]. It's easy to do and can be done by following the instructions in the documentation provided.

 import qdrant_client
 
client = qdrant_client.QdrantClient(
    host="localhost",
    prefer_grpc=True,
)

You can test the connection to the Qdrant server instance by executing any available method. For instance:

 response = client.get_collections()
print(response)

This should output something like:

collections=[]

Step 5: Preparing Data

We prepared some CSV files with the information of some well-known historical figures. The data contains the names and short stories of historical figures, which makes it ready for getting embedded and querying.

 df = pd.read_csv('./historical_figures_embeddings.csv')
df["embeddings"] = df.embeddings.apply(literal_eval)

Step 6: Indexing Data

We have created the collection HistoricalFigures in Qdrant and successfully populated it with the precomputed data containing embeddings of historical figures' attributes.

 from qdrant_client.http import models as rest
 
client = QdrantClient(":memory:")
 
vector_size = len(df["embeddings"][0])
 
client.recreate_collection(
    collection_name="HistoricalFigures",
    vectors_config={
        "content": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
 
client.upsert(
    collection_name="HistoricalFigures",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "content": v["embeddings"],
            },
            payload=v.to_dict(),
        )
        for k, v in df.iterrows()
    ],
)
# Check the collection size to make sure all the points have been stored
client.count(collection_name="HistoricalFigures")

Step 7: Searching Historical Figures

Searching for historical figures is made possible through Qdrant and OpenAI embeddings. With these technologies, we can input a query which will then be converted into an embedding, and the closest matches in the HistoricalFigures collection will be found.

 import openai
 
def query_qdrant(query, collection_name, vector_name="content", top_k=10):
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002",
    )["data"][0]["embedding"]
 
    results = client.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )
 
    return results

Step 8: Presenting Results

Display the search results to users in a clear and informative manner. The results include relevant historical figures based on the user's query.

 results = query_qdrant("What were the major scientific discoveries of the 20th century?", "HistoricalFigures")
for i, content in enumerate(results):
    json_obj = json.loads(content.payload["properties"])
    name = json_obj["person_name"]
    print(f"{i + 1}. {name} (Score: {round(content.score, 3)})")

Albert Einstein (Score: 0.827)
Isaac Newton (Score: 0.821)
James Watt (Score: 0.819)
Wilhelm Conrad Röntgen (Score: 0.817)
Charles Robert Darwin (Score: 0.816)
Yukawa Hideki (Score: 0.815)
Roger Bacon (Score: 0.815)
Isidore Auguste Marie François Xavier Comte (Score: 0.814)
Antoine-Laurent de Lavoisier (Score: 0.813)
Louis Pasteur (Score: 0.812)

Conclusion

By combining Qdrant's powerful vector database capabilities with OpenAI's embedding capabilities based on advanced text comprehension, we have created an environment that enables users to discover the historical figures they want to know about quickly.

This article is in the form of a tutorial that also includes source code, and from a developer's perspective, we found Qdrant to be very easy to use, with excellent performance and extensive documentation for developers. The environment was easy to build and deployed in a Docker container, so the construction of the search engine proceeded smoothly. If you are considering using a vector database or building a document search engine, please try Qdrant.

References

Back

	import os

	os.environ["OPENAI_API_KEY"]="OPENAI_API_KEY"
	embeddings = OpenAIEmbeddings()

	if os.getenv("OPENAI_API_KEY") is not None:
	print("OPENAI_API_KEY is ready")
	else:
	print("OPENAI_API_KEY environment variable not found")

	import qdrant_client

	client = qdrant_client.QdrantClient(
	host="localhost",
	prefer_grpc=True,
	)

	df = pd.read_csv('./historical_figures_embeddings.csv')
	df["embeddings"] = df.embeddings.apply(literal_eval)

	from qdrant_client.http import models as rest

	client = QdrantClient(":memory:")

	vector_size = len(df["embeddings"][0])

	client.recreate_collection(
	collection_name="HistoricalFigures",
	vectors_config={
	"content": rest.VectorParams(
	distance=rest.Distance.COSINE,
	size=vector_size,
	),
	}
	)

	client.upsert(
	collection_name="HistoricalFigures",
	points=[
	rest.PointStruct(
	id=k,
	vector={
	"content": v["embeddings"],
	},
	payload=v.to_dict(),
	)
	for k, v in df.iterrows()
	],
	)
	# Check the collection size to make sure all the points have been stored
	client.count(collection_name="HistoricalFigures")

	import openai

	def query_qdrant(query, collection_name, vector_name="content", top_k=10):
	embedded_query = openai.Embedding.create(
	input=query,
	model="text-embedding-ada-002",
	)["data"][0]["embedding"]

	results = client.search(
	collection_name=collection_name,
	query_vector=(
	vector_name, embedded_query
	),
	limit=top_k,
	)

	return results

	results = query_qdrant("What were the major scientific discoveries of the 20th century?", "HistoricalFigures")
	for i, content in enumerate(results):
	json_obj = json.loads(content.payload["properties"])
	name = json_obj["person_name"]
	print(f"{i + 1}. {name} (Score: {round(content.score, 3)})")