Qdrant Vector Database in GenAI Projects - Part 1 (Ingestion)

Qdrant Console

Qdrant Console.

Introduction

This year, I have been working on a new and very interesting generative AI Proof of Concept (POC).

The purpose of this POC is to validate that we can use historical data and ground truth documents stored in Qdrant to generate similar structured data for new documents for one of our customer. The customer needs to automate a certain manual comment reviewing process which has previously consumed a lot of subject-matter experts’ work.

This will be a two-part blog post series:

Part 1 (this post): Ingestion phase - how we prepare data and load it into Qdrant.
Part 2: How we use Qdrant collections together with an LLM to generate new data for incoming content.

In this first part, I focus on practical setup and ingestion steps that run on a developer machine.

What is Qdrant Vector Database?

A vector database stores data as high-dimensional numeric vectors (embeddings) instead of only plain text fields or relational rows. In practice, text is first converted into vectors by an embedding model, and then those vectors are indexed for fast similarity search. In this POC, I use Qdrant vector database (as a docker container).

In this POC, each comment or document chunk is stored as a vector plus metadata payload (for example: category, source file, section index). This allows the system to retrieve semantically similar historical items and relevant ground truth snippets, even when the wording is different.

Vector databases are commonly used in RAG systems, semantic search, recommendation engines, and duplicate/near-duplicate detection.

For this POC, Qdrant has worked well because it is easy to run locally, has a clear API, and supports metadata filtering together with vector search. The trade-offs are that collection design and chunking strategy require careful tuning, and running everything locally means resource limits can appear when datasets or query volume grow.

Source Material

The input data in this POC comes from two types of sources:

Historical feedback data (for example, comments collected earlier in spreadsheet format).
Ground truth documents (authoritative reference documents, later converted to markdown).

The historical feedback acts as the “memory” of previous decisions. The ground truth documents act as the “source of truth” for current recommendations and constraints.

Comment Processing POC Overview

At a high level, the pipeline has two goals:

Comment categorization: classify incoming comments to predefined categories.
Automated response/change generation: suggest structured outputs for those comments.

In this phase, we are not yet generating final outputs with the LLM. We are building the retrieval foundation first: high-quality vector collections in Qdrant.

Setup

Create Python Virtual Environment

I use Python 3.13 in this POC.

In practice, this means creating an isolated Python environment for the project, activating it, upgrading package tooling, and installing the project dependencies from the requirements file.

Main libraries include:

qdrant-client
boto3
pandas
openpyxl
fastembed

Ingestion

Convert Comments in Excel Files to One JSON File

The first ingestion step is to normalize source spreadsheets into one JSON structure.

After conversion, I also run a statistics step to validate the distribution and basic quality of the generated data.

This gives me a simple sanity check before moving to embeddings and vector storage.

Convert PDF Documents to Markdown

Ground truth PDFs are converted to markdown for easier chunking and indexing.

Because PDF-to-markdown conversion can be resource intensive, I run that phase on a temporary cloud instance, then copy resulting markdown files back to the local project.

The result is two markdown documents representing the current draft material and a higher-level reference document.

Enrich Comments JSON File

Next, I enrich the comment JSON with AI-assisted categorization.

This step adds fields such as category and AI reasoning that later help both retrieval and evaluation.

When needed, I also export the enriched dataset into a review-friendly format so domain experts can validate and refine the categorization quality.

Split Comments to Historical and Test Comments

To evaluate the pipeline realistically, I split enriched comments into two datasets:

Historical dataset (80%): uploaded to Qdrant as prior examples.
Test dataset (20%): used as unseen input for generation and validation.

The test subset intentionally excludes generated fields (like final category/change/reasoning), because those are produced by the AI pipeline later.

Upload Data to Qdrant Vector Database

In this step, I start the local Qdrant service and run the upload process from the Python environment.

When the upload finishes, Qdrant contains both historical memory and ground truth collections.

Qdrant Collections Overview

Qdrant Collection Architecture

The architecture uses separate collections for distinct retrieval purposes:

One collection for historical examples.
Two collections for ground truth documents.

This separation keeps prompts focused and retrieval behavior explicit.

1. Historical Comments

This collection is the system memory.

Purpose: Few-shot guidance from semantically similar past comments.
Typical payload: Various fields in the original Excel files, and new fields related to this POC.
Benefit: Helps keep generated outputs consistent with earlier decisions.

2. Ground Truths

These collections store chunked sections from authoritative markdown documents.

Purpose: Ground generation in the actual source text.
Typical payload: Content, source filename, section index, chunk length.
Benefit: Reduces hallucinations and makes decisions easier to justify.

Qdrant Dashboard - Quick Start Guide

1. Access the Dashboard

After Qdrant is running locally, I open the dashboard in a browser to verify collection status and payload quality.

2. Browsing Collections

Open Collections from the left menu.
Verify all expected collections exist.
Open each collection and check point count and vector config.

3. Inspecting Points and Metadata

Open the Points tab.
Expand a point payload.
Validate key metadata fields (source, category, section index, etc.).

4. Running Test Queries (Console)

In Console, I run simple filter-based checks to verify that metadata fields are queryable (for example, retrieving points for one specific category). This is a quick way to confirm that ingestion produced the expected structure for downstream retrieval.

Conclusion

In this first part, I showed how I built the ingestion foundation for a GenAI POC using Qdrant:

normalize source files,
enrich and split datasets,
upload historical and ground truth collections,
validate everything in the Qdrant dashboard.

In Part 2, I will show how we use these collections with an LLM to classify new items and generate structured outputs based on historical patterns and ground truth evidence.

The writer is working at a major international IT corporation building cloud infrastructures and implementing genAI applications on top of those infrastructures.

Kari Marttila

_{Kari Marttila’s Home Page in LinkedIn: https://www.linkedin.com/in/karimarttila/}