merge-duplicate-companies
关于
This skill identifies duplicate company records in HubSpot by domain and name, then exports audit CSVs to guide the merging process. It uses the HubSpot API for discovery but requires third-party tools or manual UI merges since HubSpot lacks a bulk merge API. Use it for database hygiene to consolidate fragmented contact and deal data.
快速安装
Claude Code
推荐npx skills add TomGranot/hubspot-admin-skills -a claude-code/plugin add https://github.com/TomGranot/hubspot-admin-skillsgit clone https://github.com/TomGranot/hubspot-admin-skills.git ~/.claude/skills/merge-duplicate-companies在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Merge Duplicate Companies
Purpose
Duplicate company records fragment contacts, deals, and engagement history across multiple records for the same real-world company. This leads to inaccurate reporting, broken associations, sales confusion, and workflow failures. This skill identifies duplicates by domain and by name, exports prioritized audit CSVs, and guides the user through merging.
Prerequisites
- A HubSpot private app access token with
crm.objects.companies.readscope - Python 3.10+ with
uvfor package management - A
.envfile containingHUBSPOT_ACCESS_TOKEN - Super Admin permissions for merging in the HubSpot UI
Key Constraint
HubSpot has no bulk merge API. Merging must happen one pair at a time through the HubSpot UI or via third-party tools. The API is used for discovery, analysis, and audit trail generation.
HubSpot's built-in Duplicates tool is NOT available on all plan tiers. Check whether the account has access to Settings > Data Management > Duplicates before relying on it.
Execution Pattern
This skill follows a 4-stage execution pattern: Plan -> Before State -> Execute -> After State.
Stage 1: Plan
Before writing any code, confirm with the user:
- Confirm intentional duplicates: Ask whether separate records for regional offices of the same company are intentional. If so, exclude those from merging.
- Merging is irreversible. Once two company records are merged, they cannot be un-merged. The surviving record inherits all associations, but property values from the deleted record may be lost if both have the same property filled in.
- Prioritization strategy: Recommend merging Customer-stage companies first, then Opportunity-stage, then everything else.
- Time estimate: This is the most time-consuming process. Budget 2-4 hours for critical duplicates, 8-12 hours total for full cleanup.
Stage 2: Before State
Fetch all companies, identify duplicate groups by domain and name, and export audit CSVs.
"""
Before State: Identify duplicate companies by domain and by name.
Creates CSV audit logs for review before merging.
"""
import os
import csv
import time
import requests
from collections import defaultdict
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.environ["HUBSPOT_ACCESS_TOKEN"]
BASE = "https://api.hubapi.com"
headers = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json",
}
# --- Step 1: Fetch all companies ---
print("Fetching all companies...")
all_companies = []
after = None
while True:
params = {
"limit": 100,
"properties": "name,domain,lifecyclestage,num_associated_contacts,"
"num_associated_deals,hubspot_owner_id,createdate",
}
if after:
params["after"] = after
resp = requests.get(
f"{BASE}/crm/v3/objects/companies",
headers=headers, params=params,
)
if resp.status_code != 200:
print(f"Stopped at {len(all_companies)} (status {resp.status_code})")
break
data = resp.json()
for company in data.get("results", []):
props = company.get("properties", {})
all_companies.append({
"id": company["id"],
"name": (props.get("name") or "").strip(),
"domain": (props.get("domain") or "").strip().lower(),
"lifecycle_stage": props.get("lifecyclestage", ""),
"associated_contacts": props.get("num_associated_contacts", "0"),
"associated_deals": props.get("num_associated_deals", "0"),
"owner_id": props.get("hubspot_owner_id", ""),
"createdate": props.get("createdate", ""),
})
paging = data.get("paging", {})
after = paging.get("next", {}).get("after")
if not after:
break
time.sleep(0.05)
print(f"Total companies fetched: {len(all_companies)}")
# --- Step 2: Find duplicates by domain ---
print("\nAnalyzing duplicates by domain...")
domain_groups = defaultdict(list)
for c in all_companies:
if c["domain"]:
domain_groups[c["domain"]].append(c)
dup_domain_groups = {d: cs for d, cs in domain_groups.items() if len(cs) > 1}
dup_domain_records = sum(len(cs) for cs in dup_domain_groups.values())
print(f"Unique domains with duplicates: {len(dup_domain_groups)}")
print(f"Total records in duplicate domain groups: {dup_domain_records}")
# Top offenders
sorted_domains = sorted(dup_domain_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate domains:")
for domain, companies in sorted_domains[:15]:
print(f" {domain}: {len(companies)} records")
# --- Step 3: Find duplicates by name ---
print("\nAnalyzing duplicates by name...")
name_groups = defaultdict(list)
for c in all_companies:
if c["name"]:
name_groups[c["name"].lower()].append(c)
dup_name_groups = {n: cs for n, cs in name_groups.items() if len(cs) > 1}
dup_name_records = sum(len(cs) for cs in dup_name_groups.values())
print(f"Unique names with duplicates: {len(dup_name_groups)}")
print(f"Total records in duplicate name groups: {dup_name_records}")
sorted_names = sorted(dup_name_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate names:")
for name_lower, companies in sorted_names[:15]:
print(f" {companies[0]['name']}: {len(companies)} records")
# --- Step 4: Save CSV audit logs ---
os.makedirs("data/audit-logs", exist_ok=True)
# Domain duplicates CSV
domain_csv = "data/audit-logs/duplicate-companies-by-domain.csv"
with open(domain_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"domain", "duplicate_count", "id", "name", "lifecycle_stage",
"associated_contacts", "associated_deals", "owner_id", "createdate",
])
writer.writeheader()
for domain, companies in sorted_domains:
for c in companies:
writer.writerow({
"domain": domain,
"duplicate_count": len(companies),
**{k: c[k] for k in [
"id", "name", "lifecycle_stage", "associated_contacts",
"associated_deals", "owner_id", "createdate",
]},
})
print(f"\nDomain duplicates CSV: {domain_csv}")
# Name duplicates CSV
name_csv = "data/audit-logs/duplicate-companies-by-name.csv"
with open(name_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"duplicate_name", "duplicate_count", "id", "name", "domain",
"lifecycle_stage", "associated_contacts", "associated_deals",
"owner_id", "createdate",
])
writer.writeheader()
for name_lower, companies in sorted_names:
for c in companies:
writer.writerow({
"duplicate_name": name_lower,
"duplicate_count": len(companies),
**{k: c[k] for k in [
"id", "name", "domain", "lifecycle_stage",
"associated_contacts", "associated_deals",
"owner_id", "createdate",
]},
})
print(f"Name duplicates CSV: {name_csv}")
Present findings to the user. Key data points:
- Total duplicate domain groups and affected records
- Total duplicate name groups and affected records
- Top offenders by domain and name
- CSVs for manual review
Stage 3: Execute
This stage is primarily manual. Guide the user through the merging process.
Option A: HubSpot Built-In Duplicates Tool (if available)
- Navigate to Settings > Data Management > Duplicates > Companies
- HubSpot shows suggested duplicate pairs ranked by confidence
- For each pair, click Review to see side-by-side comparison
- Select the "primary" (surviving) record based on:
- More associated contacts
- More associated deals
- More recent activity
- Has a company owner
- More complete property data
- Click Merge
- Process ~50 pairs at a time; HubSpot loads the next batch automatically
Prioritization order:
- Customer-stage company duplicates (highest value data)
- Opportunity-stage company duplicates
- Everything else (Leads, Subscribers)
Option B: Manual search-and-merge for top offenders
For companies with many duplicates (4+ records):
- Search for the company by name in Contacts > Companies
- Identify the "winner" record (most associations, deals, activity)
- Open the winner record > Actions > Merge
- Search for the duplicate > select it > choose property values > Merge
- Repeat until only one record remains
Option C: Third-party deduplication tools
For large-scale merging, recommend:
- Dedupely (dedupely.com) -- HubSpot-native integration, bulk merge
- Insycle (insycle.com) -- Data management platform with dedup
- Koalify (koalify.com) -- HubSpot duplicate management
These tools can automate bulk merges that would take hours manually.
Prevention: Configure auto-association after merging
Settings > Data Management > Companies (or Settings > Objects > Companies)
Enable: "Create and associate companies with contacts"
Set unique identifier: Company domain name
This prevents future duplicates by using domain-based matching instead of name-based.
Stage 4: After State
Re-run the Before State analysis and compare duplicate counts.
"""
After State: Verify duplicate reduction.
"""
# Re-fetch all companies and re-run duplicate analysis
# Compare:
# - Number of duplicate domain groups (should decrease)
# - Number of duplicate name groups (should decrease)
# - Top offenders (should be resolved)
# Also verify merged records:
# For each known duplicate that was merged, search for the company
# and confirm only one record exists with all expected associations.
Manual verification:
- Search for top offenders by name (should show only 1 record each)
- Open merged records and verify contacts and deals from both originals appear
- Check Settings > Data Management > Duplicates -- count should be significantly lower
Safety Mechanisms
| Mechanism | Detail |
|---|---|
| CSV audit trail | Complete export of all companies with duplicate group annotations before any merging. |
| Prioritized approach | Customer and Opportunity companies merged first to protect highest-value data. |
| Review before merge | CSVs enable team review before any irreversible merges happen. |
| Confirmation prompt | Present duplicate analysis to the user and wait for explicit confirmation before instructing merges. |
| No auto-merge | This skill never merges automatically. All merges require manual human decision. |
Technical Gotchas
-
HubSpot has no bulk merge API. There is no programmatic way to merge companies. All merges happen through the UI or third-party tools.
-
Merging is irreversible. Once merged, records cannot be split apart. When in doubt, skip a pair and revisit later.
-
Property conflicts: When both records have a value for the same property, HubSpot keeps the value from the "primary" record. Review important properties (phone, address, industry) before confirming.
-
Companies endpoint uses GET, not POST/search. To list all companies, use
GET /crm/v3/objects/companieswith pagination, not the Search API. The Search API works too but is slower for full exports. -
Domain normalization: Always lowercase and strip whitespace from domains before grouping.
Example.comandexample.comare the same company. -
Name-based duplicates have higher false-positive rates. "State University" might match multiple genuinely different institutions. Domain-based duplicates are more reliable.
-
Contact reassociation: After merging, verify that contacts from both original records appear under the surviving record. HubSpot should handle this automatically, but spot-check.
-
The Duplicates tool is plan-tier dependent. Not all HubSpot plans include it. Check availability before instructing the user to navigate there.
Package Setup
uv init hubspot-cleanup
cd hubspot-cleanup
uv add requests python-dotenv
Create a .env file:
HUBSPOT_ACCESS_TOKEN=pat-na1-xxxxxxxx
GitHub 仓库
相关推荐技能
executing-plans
设计该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。
requesting-code-review
设计该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。
connect-mcp-server
设计这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。
web-cli-teleport
设计该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。
