qdrant-vector-search

davila7

Updated Today

4 views

15,516

1,344

15,516

View on GitHub

MetaRAGVector SearchQdrantSemantic SearchEmbeddingsSimilarity SearchHNSWProductionDistributed

About

The qdrant-vector-search skill provides a high-performance vector similarity search engine for building production RAG systems and semantic search applications. It enables fast nearest neighbor search with hybrid filtering capabilities and offers scalable vector storage powered by Rust for low-latency operations. Use it when you need on-premise deployment with full data control or require horizontal scaling with sharding and replication.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/qdrant-vector-search

Copy and paste this command in Claude Code to install this skill

Documentation

Qdrant - Vector Similarity Search Engine

High-performance vector database written in Rust for production RAG and semantic search.

When to use Qdrant

Use Qdrant when:

Building production RAG systems requiring low latency
Need hybrid search (vectors + metadata filtering)
Require horizontal scaling with sharding/replication
Want on-premise deployment with full data control
Need multi-vector storage per record (dense + sparse)
Building real-time recommendation systems

Key features:

Rust-powered: Memory-safe, high performance
Rich filtering: Filter by any payload field during search
Multiple vectors: Dense, sparse, multi-dense per point
Quantization: Scalar, product, binary for memory efficiency
Distributed: Raft consensus, sharding, replication
REST + gRPC: Both APIs with full feature parity

Use alternatives instead:

Chroma: Simpler setup, embedded use cases
FAISS: Maximum raw speed, research/batch processing
Pinecone: Fully managed, zero ops preferred
Weaviate: GraphQL preference, built-in vectorizers

Quick start

Installation

# Python client
pip install qdrant-client

# Docker (recommended for development)
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Docker with persistent storage
docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant

Basic usage

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Connect to Qdrant
client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Insert vectors with payload
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=[0.1, 0.2, ...],  # 384-dim vector
            payload={"title": "Doc 1", "category": "tech"}
        ),
        PointStruct(
            id=2,
            vector=[0.3, 0.4, ...],
            payload={"title": "Doc 2", "category": "science"}
        )
    ]
)

# Search with filtering
results = client.search(
    collection_name="documents",
    query_vector=[0.15, 0.25, ...],
    query_filter={
        "must": [{"key": "category", "match": {"value": "tech"}}]
    },
    limit=10
)

for point in results:
    print(f"ID: {point.id}, Score: {point.score}, Payload: {point.payload}")

Core concepts

Points - Basic data unit

from qdrant_client.models import PointStruct

# Point = ID + Vector(s) + Payload
point = PointStruct(
    id=123,                              # Integer or UUID string
    vector=[0.1, 0.2, 0.3, ...],        # Dense vector
    payload={                            # Arbitrary JSON metadata
        "title": "Document title",
        "category": "tech",
        "timestamp": 1699900000,
        "tags": ["python", "ml"]
    }
)

# Batch upsert (recommended)
client.upsert(
    collection_name="documents",
    points=[point1, point2, point3],
    wait=True  # Wait for indexing
)

Collections - Vector containers

from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

# Create with HNSW configuration
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=384,                        # Vector dimensions
        distance=Distance.COSINE         # COSINE, EUCLID, DOT, MANHATTAN
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                            # Connections per node (default 16)
        ef_construct=100,                # Build-time accuracy (default 100)
        full_scan_threshold=10000        # Switch to brute force below this
    ),
    on_disk_payload=True                 # Store payload on disk
)

# Collection info
info = client.get_collection("documents")
print(f"Points: {info.points_count}, Vectors: {info.vectors_count}")

Distance metrics

Metric	Use Case	Range
`COSINE`	Text embeddings, normalized vectors	0 to 2
`EUCLID`	Spatial data, image features	0 to ∞
`DOT`	Recommendations, unnormalized	-∞ to ∞
`MANHATTAN`	Sparse features, discrete data	0 to ∞

Search operations

Basic search

# Simple nearest neighbor search
results = client.search(
    collection_name="documents",
    query_vector=[0.1, 0.2, ...],
    limit=10,
    with_payload=True,
    with_vectors=False  # Don't return vectors (faster)
)

Filtered search

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Complex filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="tech")),
            FieldCondition(key="timestamp", range=Range(gte=1699000000))
        ],
        must_not=[
            FieldCondition(key="status", match=MatchValue(value="archived"))
        ]
    ),
    limit=10
)

# Shorthand filter syntax
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter={
        "must": [
            {"key": "category", "match": {"value": "tech"}},
            {"key": "price", "range": {"gte": 10, "lte": 100}}
        ]
    },
    limit=10
)

Batch search

from qdrant_client.models import SearchRequest

# Multiple queries in one request
results = client.search_batch(
    collection_name="documents",
    requests=[
        SearchRequest(vector=[0.1, ...], limit=5),
        SearchRequest(vector=[0.2, ...], limit=5, filter={"must": [...]}),
        SearchRequest(vector=[0.3, ...], limit=10)
    ]
)

RAG integration

With sentence-transformers

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

# Initialize
encoder = SentenceTransformer("all-MiniLM-L6-v2")
client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Index documents
documents = [
    {"id": 1, "text": "Python is a programming language", "source": "wiki"},
    {"id": 2, "text": "Machine learning uses algorithms", "source": "textbook"},
]

points = [
    PointStruct(
        id=doc["id"],
        vector=encoder.encode(doc["text"]).tolist(),
        payload={"text": doc["text"], "source": doc["source"]}
    )
    for doc in documents
]
client.upsert(collection_name="knowledge_base", points=points)

# RAG retrieval
def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_vector = encoder.encode(query).tolist()
    results = client.search(
        collection_name="knowledge_base",
        query_vector=query_vector,
        limit=top_k
    )
    return [{"text": r.payload["text"], "score": r.score} for r in results]

# Use in RAG pipeline
context = retrieve("What is Python?")
prompt = f"Context: {context}\n\nQuestion: What is Python?"

With LangChain

from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Qdrant.from_documents(documents, embeddings, url="http://localhost:6333", collection_name="docs")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

With LlamaIndex

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

vector_store = QdrantVectorStore(client=client, collection_name="llama_docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
query_engine = index.as_query_engine()

Multi-vector support

Named vectors (different embedding models)

from qdrant_client.models import VectorParams, Distance

# Collection with multiple vector types
client.create_collection(
    collection_name="hybrid_search",
    vectors_config={
        "dense": VectorParams(size=384, distance=Distance.COSINE),
        "sparse": VectorParams(size=30000, distance=Distance.DOT)
    }
)

# Insert with named vectors
client.upsert(
    collection_name="hybrid_search",
    points=[
        PointStruct(
            id=1,
            vector={
                "dense": dense_embedding,
                "sparse": sparse_embedding
            },
            payload={"text": "document text"}
        )
    ]
)

# Search specific vector
results = client.search(
    collection_name="hybrid_search",
    query_vector=("dense", query_dense),  # Specify which vector
    limit=10
)

Sparse vectors (BM25, SPLADE)

from qdrant_client.models import SparseVectorParams, SparseIndexParams, SparseVector

# Collection with sparse vectors
client.create_collection(
    collection_name="sparse_search",
    vectors_config={},
    sparse_vectors_config={"text": SparseVectorParams(index=SparseIndexParams(on_disk=False))}
)

# Insert sparse vector
client.upsert(
    collection_name="sparse_search",
    points=[PointStruct(id=1, vector={"text": SparseVector(indices=[1, 5, 100], values=[0.5, 0.8, 0.2])}, payload={"text": "document"})]
)

Quantization (memory optimization)

from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

# Scalar quantization (4x memory reduction)
client.create_collection(
    collection_name="quantized",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,        # Clip outliers
            always_ram=True      # Keep quantized in RAM
        )
    )
)

# Search with rescoring
results = client.search(
    collection_name="quantized",
    query_vector=query,
    search_params={"quantization": {"rescore": True}},  # Rescore top results
    limit=10
)

Payload indexing

from qdrant_client.models import PayloadSchemaType

# Create payload index for faster filtering
client.create_payload_index(
    collection_name="documents",
    field_name="category",
    field_schema=PayloadSchemaType.KEYWORD
)

client.create_payload_index(
    collection_name="documents",
    field_name="timestamp",
    field_schema=PayloadSchemaType.INTEGER
)

# Index types: KEYWORD, INTEGER, FLOAT, GEO, TEXT (full-text), BOOL

Production deployment

Qdrant Cloud

from qdrant_client import QdrantClient

# Connect to Qdrant Cloud
client = QdrantClient(
    url="https://your-cluster.cloud.qdrant.io",
    api_key="your-api-key"
)

Performance tuning

# Optimize for search speed (higher recall)
client.update_collection(
    collection_name="documents",
    hnsw_config=HnswConfigDiff(ef_construct=200, m=32)
)

# Optimize for indexing speed (bulk loads)
client.update_collection(
    collection_name="documents",
    optimizer_config={"indexing_threshold": 20000}
)

Best practices

Batch operations - Use batch upsert/search for efficiency
Payload indexing - Index fields used in filters
Quantization - Enable for large collections (>1M vectors)
Sharding - Use for collections >10M vectors
On-disk storage - Enable on_disk_payload for large payloads
Connection pooling - Reuse client instances

Common issues

Slow search with filters:

# Create payload index for filtered fields
client.create_payload_index(
    collection_name="docs",
    field_name="category",
    field_schema=PayloadSchemaType.KEYWORD
)

Out of memory:

# Enable quantization and on-disk storage
client.create_collection(
    collection_name="large_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(...),
    on_disk_payload=True
)

Connection issues:

# Use timeout and retry
client = QdrantClient(
    host="localhost",
    port=6333,
    timeout=30,
    prefer_grpc=True  # gRPC for better performance
)

References

Advanced Usage - Distributed mode, hybrid search, recommendations
Troubleshooting - Common issues, debugging, performance tuning

Resources

GitHub: https://github.com/qdrant/qdrant (22k+ stars)
Docs: https://qdrant.tech/documentation/
Python Client: https://github.com/qdrant/qdrant-client
Cloud: https://cloud.qdrant.io
Version: 1.12.0+
License: Apache 2.0

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/rag-qdrant

anthropicanthropic-claudeclaudeclaude-code

Related Skills

training-llms-megatron

Design

This skill trains massive language models (2B-462B parameters) using NVIDIA's Megatron-Core framework for maximum GPU efficiency. Use it when training models over 1B parameters, requiring advanced parallelism strategies like tensor or pipeline, or needing production-ready performance. It's a proven framework used for models like Nemotron and LLaMA.

View skill

pinecone

Development

Pinecone is a fully managed vector database for production AI applications, featuring auto-scaling, low latency (<100ms p95), and hybrid search. It's ideal for developers who need a serverless solution for production RAG, semantic search, or recommendation systems without managing infrastructure. Use it when you require metadata filtering, namespaces, and scaling to billions of vectors.

View skill

tensorrt-llm

Other

TensorRT-LLM is an NVIDIA-optimized library for deploying LLMs on NVIDIA GPUs, delivering up to 100x faster inference than PyTorch. Use it for production serving where you need maximum throughput, low latency, and support for features like quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

View skill