data-anonymizer
About
This Claude skill detects and masks PII like names, emails, and phone numbers in both text strings and CSV files. It offers multiple anonymization strategies, including reversible tokenization for data mapping. Use it to quickly sanitize sensitive data in documents or structured datasets for development or testing.
Quick Install
Claude Code
Recommended/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-anonymizerCopy and paste this command in Claude Code to install this skill
Documentation
Data Anonymizer
Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.
Quick Start
from scripts.data_anonymizer import DataAnonymizer
# Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at [email protected] or 555-123-4567")
print(result)
# "Contact [NAME] at [EMAIL] or [PHONE]"
# Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
Features
- PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
- Multiple Strategies: Mask, redact, hash, fake data replacement
- CSV Processing: Anonymize specific columns or auto-detect
- Reversible Tokens: Optional mapping for de-anonymization
- Custom Patterns: Add your own PII patterns
- Audit Report: List all detected PII with locations
API Reference
Initialization
anonymizer = DataAnonymizer(
strategy="mask", # mask, redact, hash, fake
reversible=False # Enable token mapping
)
Text Anonymization
# Basic anonymization
result = anonymizer.anonymize(text)
# With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])
# Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)
Masking Strategies
text = "Email [email protected], call 555-1234"
# Mask (default) - replace with type labels
anonymizer.strategy = "mask"
# "Email [EMAIL], call [PHONE]"
# Redact - replace with asterisks
anonymizer.strategy = "redact"
# "Email ***************, call ********"
# Hash - replace with hash
anonymizer.strategy = "hash"
# "Email a1b2c3d4, call e5f6g7h8"
# Fake - replace with realistic fake data
anonymizer.strategy = "fake"
# "Email [email protected], call 555-9876"
CSV Processing
# Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")
# Specify columns
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
columns=["name", "email", "phone"]
)
# Different strategies per column
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
column_strategies={
"name": "fake",
"email": "hash",
"ssn": "redact"
}
)
Reversible Anonymization
anonymizer = DataAnonymizer(reversible=True)
# Anonymize with token mapping
result = anonymizer.anonymize("John Smith: [email protected]")
mapping = anonymizer.get_mapping()
# Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
# Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)
Custom Patterns
# Add custom PII pattern
anonymizer.add_pattern(
name="employee_id",
pattern=r"EMP-\d{6}",
label="[EMPLOYEE_ID]"
)
CLI Usage
# Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt
# Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv
# Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
# Generate audit report
python data_anonymizer.py --input document.txt --report audit.json
# Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn
CLI Arguments
| Argument | Description | Default |
|---|---|---|
--input | Input file | Required |
--output | Output file | Required |
--strategy | Masking strategy | mask |
--types | PII types to detect | all |
--columns | CSV columns to process | auto |
--report | Generate audit report | - |
--reversible | Enable token mapping | False |
Supported PII Types
| Type | Examples | Pattern |
|---|---|---|
name | John Smith, Mary Johnson | NLP-based |
email | [email protected] | Regex |
phone | 555-123-4567, (555) 123-4567 | Regex |
ssn | 123-45-6789 | Regex |
credit_card | 4111-1111-1111-1111 | Regex + Luhn |
address | 123 Main St, City, ST 12345 | NLP + Regex |
date_of_birth | 01/15/1990, January 15, 1990 | Regex |
ip_address | 192.168.1.1 | Regex |
Examples
Anonymize Customer Support Logs
anonymizer = DataAnonymizer(strategy="mask")
log = """
Ticket #1234: Customer John Doe ([email protected]) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""
result = anonymizer.anonymize(log)
print(result)
# Ticket #1234: Customer [NAME] ([EMAIL]) called about
# billing issue. SSN on file: [SSN]. Callback number: [PHONE].
# Address: [ADDRESS].
GDPR Compliance for Database Export
anonymizer = DataAnonymizer(strategy="hash")
# Consistent hashing for joins
anonymizer.anonymize_csv(
"users.csv",
"users_anon.csv",
columns=["email", "name", "phone"]
)
anonymizer.anonymize_csv(
"orders.csv",
"orders_anon.csv",
columns=["customer_email"] # Same hash as users.email
)
Generate Test Data from Production
anonymizer = DataAnonymizer(strategy="fake")
# Replace real PII with realistic fake data
anonymizer.anonymize_csv(
"production_data.csv",
"test_data.csv"
)
# Test data has same structure but fake PII
Dependencies
pandas>=2.0.0
faker>=18.0.0
Limitations
- Name detection may miss unusual names
- Address detection works best for US formats
- Custom patterns may be needed for domain-specific PII
- Fake data replacement doesn't preserve exact format
GitHub Repository
Related Skills
content-collections
MetaThis skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
