content-similarity-checker
About
This skill compares document similarity using TF-IDF, cosine similarity, and Jaccard index algorithms. It's designed for plagiarism detection, duplicate finding, and content matching tasks. Developers can use it for pairwise comparisons, batch processing, and generating detailed similarity reports.
Quick Install
Claude Code
Recommended/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/content-similarity-checkerCopy and paste this command in Claude Code to install this skill
Documentation
Content Similarity Checker
Compare documents and text for similarity using multiple algorithms.
Features
- Cosine Similarity: TF-IDF based comparison
- Jaccard Similarity: Set-based comparison
- Levenshtein Distance: Edit distance for short texts
- Batch Comparison: Compare multiple documents
- Similarity Matrix: Pairwise comparison of all documents
- Reports: Detailed similarity reports
Quick Start
from similarity_checker import SimilarityChecker
checker = SimilarityChecker()
# Compare two texts
score = checker.compare(
"The quick brown fox jumps over the lazy dog",
"A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")
# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")
CLI Usage
# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv
# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7
# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
API Reference
SimilarityChecker Class
class SimilarityChecker:
def __init__(self, method: str = "cosine")
# Text comparison
def compare(self, text1: str, text2: str) -> float
def compare_files(self, file1: str, file2: str) -> float
# Multiple algorithms
def compare_all_methods(self, text1: str, text2: str) -> dict
# Batch comparison
def compare_to_corpus(self, text: str, corpus: list) -> list
def similarity_matrix(self, documents: list) -> pd.DataFrame
def find_duplicates(self, documents: list, threshold: float = 0.8) -> list
# Folder operations
def compare_folder(self, folder: str, threshold: float = None) -> dict
def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list
# Report
def generate_report(self, output: str) -> str
Similarity Methods
Cosine Similarity (Default)
Best for comparing documents of different lengths:
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
Jaccard Similarity
Good for comparing sets of words/tokens:
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
Levenshtein (Edit Distance)
Best for short texts, typo detection:
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)
TF-IDF + Cosine
Advanced: considers term importance:
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)
Batch Comparison
Compare to Corpus
checker = SimilarityChecker()
target = "Machine learning is a subset of artificial intelligence."
corpus = [
"AI includes machine learning and deep learning.",
"Python is a programming language.",
"Neural networks power deep learning systems."
]
results = checker.compare_to_corpus(target, corpus)
# Returns:
[
{"index": 0, "similarity": 0.65, "text": "AI includes..."},
{"index": 2, "similarity": 0.42, "text": "Neural networks..."},
{"index": 1, "similarity": 0.12, "text": "Python is..."}
]
Similarity Matrix
documents = [
"Document one content...",
"Document two content...",
"Document three content..."
]
matrix = checker.similarity_matrix(documents)
# Returns DataFrame:
# doc_0 doc_1 doc_2
# doc_0 1.000 0.750 0.320
# doc_1 0.750 1.000 0.410
# doc_2 0.320 0.410 1.000
Find Duplicates
documents = [...] # List of texts
duplicates = checker.find_duplicates(documents, threshold=0.85)
# Returns:
[
{"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
{"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]
Compare All Methods
Get similarity scores from all algorithms:
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)
# Returns:
{
"cosine": 0.82,
"jaccard": 0.65,
"levenshtein": 0.71,
"tfidf": 0.78,
"average": 0.74
}
Folder Operations
Compare All Files in Folder
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")
# Returns:
{
"files": ["doc1.txt", "doc2.txt", "doc3.txt"],
"comparisons": 3,
"similar_pairs": [
{"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
],
"matrix": <DataFrame>
}
Find Most Similar to Query
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)
# Returns:
[
{"file": "doc3.txt", "similarity": 0.89},
{"file": "doc1.txt", "similarity": 0.72},
...
]
Output Format
Comparison Result
result = checker.compare_with_details(text1, text2)
# Returns:
{
"similarity": 0.82,
"method": "cosine",
"text1_length": 150,
"text2_length": 180,
"common_words": 25,
"unique_words_text1": 10,
"unique_words_text2": 15,
"interpretation": "High similarity - likely related content"
}
Example Workflows
Plagiarism Check
checker = SimilarityChecker()
submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")
suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]
if suspicious:
print(f"Warning: Found {len(suspicious)} potentially similar sources")
for p in suspicious:
print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")
Document Deduplication
checker = SimilarityChecker()
# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
docs[file.name] = file.read_text()
# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
Content Matching
checker = SimilarityChecker()
query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)
print("Most relevant articles:")
for r in results:
print(f" {r['file']}: {r['similarity']:.0%} match")
Dependencies
- scikit-learn>=1.3.0
- nltk>=3.8.0
- numpy>=1.24.0
- pandas>=2.0.0
GitHub Repository
Related Skills
content-collections
MetaThis skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.
cloudflare-turnstile
MetaThis skill provides comprehensive guidance for implementing Cloudflare Turnstile as a CAPTCHA-alternative bot protection system. It covers integration for forms, login pages, API endpoints, and frameworks like React/Next.js/Hono, while handling invisible challenges that maintain user experience. Use it when migrating from reCAPTCHA, debugging error codes, or implementing token validation and E2E tests.
llamaindex
MetaLlamaIndex is a data framework for building RAG-powered LLM applications, specializing in document ingestion, indexing, and querying. It provides key features like vector indices, query engines, and agents, and supports over 300 data connectors. Use it for document Q&A, chatbots, and knowledge retrieval when building data-centric applications.
canvas-design
MetaThe canvas-design skill generates original visual art in PNG and PDF formats for creating posters, designs, and other static artwork. It operates through a two-step process: first creating a design philosophy document, then visually expressing it on a canvas. The skill focuses on original compositions using form, color, and space while avoiding copyright infringement by never copying existing artists' work.
