qdrant-hybrid-search-prefetches
关于
This skill helps developers implement hybrid search in Qdrant by combining lexical (keyword/BM25) and semantic (dense vector) retrieval in a single query. It guides on using the `prefetch` API to run parallel searches on different vector representations or queries and properly configure payload indexing. Use it when needing to merge sparse/dense search methods or optimize multi-field retrieval strategies.
快速安装
Claude Code
推荐npx skills add qdrant/skills -a claude-code/plugin add https://github.com/qdrant/skillsgit clone https://github.com/qdrant/skills.git ~/.claude/skills/qdrant-hybrid-search-prefetches在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Different Searches in One Query API Request
Each prefetch runs exactly one search per one query.
Understand if user wants to run several parallel searches on:
- The same vector representations but different queries or filters.
- Different vector representations but the same raw query.
If first, help user to design logic of constructing query or/and filters on application side and then check Combining Searches. Don't forget to create indices on filterable payload fields, immediately after collection creation, prior to building HNSW, so filterable HNSW could be constructed.
If second, use named vectors, which allow to store multiple vector types per point in one collection. Beware that named vectors currently can be configured only at collection creation. To choose vectors, check following recommendations.
Missed Keyword Matches
Use when: pure vector search misses exact term or keyword matches and you need lexical retrieval alongside semantic search.
Most likely you need a sparse vector for exact text search alongside the dense one. Qdrant uses sparse vectors for lexical searches, as payload filtering doesn't provide any ranking score.
Choose a Sparse Vector for Text
- BM25 statistical representations, built into Qdrant core (computed server-side). Good baseline, works out-of-domain, usually for long texts. Can be used for non-English content, but needs to be configured per language (tokenization, stemming, stopwords, etc) at indexing and retrieval time. More in Text Search Guide
- BM42 learned sparse, based on BM25, but better for small chunks of text & with meaning understanding. Works only on English. Requires fine-tuning for domain-specific retrieval. Requires FastEmbed (Python/REST only, not available in all SDKs). Not maintained.
- miniCOIL learned sparse, BM25 with additional understanding of words meaning in context. Works only on English. Requires fine-tuning for domain-specific retrieval. Requires FastEmbed. Usage shown in FastEmbed miniCOIL documentation.
- SPLADE++ learned sparse with term expansion. Heavier inference and resources usage but better performance due to term expansion. Requires fine-tuning for domain-specific retrieval. Provided in Qdrant Cloud Inference and FastEmbed versions work only on English. To use with FastEmbed, check FastEmbed SPLADE documentation.
- External learned sparse embeddings, for example BAAI/bge-m3.
What to remember when using sparse vectors for lexical search:
- tokenization and stemming affect exact matches, especially on custom codes, terms, etc.
What to remember when using Qdrant BM25 and miniCOIL (based on BM25):
- avg_len in formula is not computed server-side, it is a user responsibility and passed as a parameter
- BM25 might be not good for small chunks of text, as BM25 algorithm was initially created for search on long documents; consider adjusting document statistics in sparse vectors (TF & IDF, k, b).
- Qdrant BM25 vectors are configured per language, so consider customizing stop words, stemming & tokenization when users documents mix several languages or carefully configure vectors per point when they are monolingual.
More on Sparse Vectors for Text Search
Need to Combine Multiple Representations of the Same Item
Use when: the same item is embedded in multiple ways (e.g. different models, languages, or modalities) and you want to search across different representations in one request (don't have to be all of them, can be even one).
Use multiple named vector prefetches, each prefetch covers one representation.
- If you have groups and subgroups of representations (document -> chunk, image -> patch), you could use searching in groups. To not store identical payloads several times, check Lookup in Groups
You can also search directly on multivectors, a matrix of dense vectors, in a prefetch.
However, it comes with several considerations, as multivectors were designed to support late interaction models using max similarity metric, so it's impossible to retrieve the list of individual max similarity scores for each query vector.
Moreover, multivectors are rarely a good pick for prefetch:
- max similarity metric is not symmetric, so using HNSW index with it could be problematic
- multivector representations are very heavy, as search process on them.
There are ways to make multivector retrieval cheaper (MUVERA, pooling), you can see more in "Evaluating Tradeoffs of Multi-stage Multi-vector Search"
What NOT to Do
- Choose any search method (for example, BM25) without evaluation of its quality & resources used.
- Use any search method (for example, BM25) without paying attention to the specifics of their configuration and applicability to the use case.
GitHub 仓库
相关推荐技能
llamaguard
其他LlamaGuard是Meta推出的7-8B参数内容审核模型,专门用于过滤LLM的输入和输出内容。它能检测六大安全风险类别(暴力/仇恨、性内容、武器、违禁品、自残、犯罪计划),准确率达94-95%。开发者可通过HuggingFace、vLLM或Sagemaker快速部署,并能与NeMo Guardrails集成实现自动化安全防护。
cost-optimization
其他这个Claude Skill帮助开发者优化云成本,通过资源调整、标记策略和预留实例来降低AWS、Azure和GCP的开支。它适用于减少云支出、分析基础设施成本或实施成本治理策略的场景。关键功能包括提供成本可视化、资源规模调整指导和定价模型优化建议。
quantizing-models-bitsandbytes
其他这个Skill使用bitsandbytes库量化大语言模型,能在GPU内存有限时通过8位或4位量化减少50-75%内存占用,同时保持精度损失最小。它支持INT8、NF4、FP4等多种量化格式,可与HuggingFace Transformers无缝集成,适用于需要部署更大模型或加速推理的场景。还提供QLoRA训练和8位优化器支持,让开发者能轻松实现高效模型压缩。
dispatching-parallel-agents
其他该Skill用于并行处理3个以上无依赖关系的独立故障,可为每个问题域分派专属Claude代理同时执行调查修复。它通过并发处理多个独立问题显著提升故障排查效率,特别适用于测试文件、子系统等无共享状态的场景。
