How a Q&A Agent Uses Hierarchical Clustering to Route Customer Questions Without Retraining
1. Background
Recently, we needed to develop a B-end Agent Q&A system. When working on user question identification, we faced the following problem:
The phrasing of each user question varies greatly, but the problems they want to solve can often be summarized into a limited number of categories.
For example:
| User's Original Question | Semantic Category |
|---|---|
| "How do I apply for a refund?" | → Refund Process |
| "I don't want it anymore, can I get my money back?" | → Refund Process |
| "How long does it take to get money back after canceling an order?" | → Refund Process |
| "How do I exchange a broken item for a new one?" | → After-Sales Repair/Exchange |
| "The received product has quality issues" | → After-Sales Repair/Exchange |
| "How long is the warranty period?" | → After-Sales Repair/Exchange |
| "How many days does shipping usually take?" | → Logistics Inquiry |
| "Where is my package?" | → Logistics Inquiry |
| "Why haven't I received the goods yet?" | → Logistics Inquiry |
As you can see, on the surface: the 8 questions are phrased completely differently, with no duplicates. In essence: they belong to only 3 topic categories (Refund, After-Sales, Logistics).
If a system can automatically complete this "User Question → Semantic Cluster" mapping, it can bring huge efficiency gains, such as:
- Intelligent Routing: When a user asks "Can I get my money back?", the Agent directly identifies it as belonging to the 'Refund Process' cluster and can call the corresponding processing capability without multiple rounds of follow-up questions.
- Batch Processing: Questions within the same cluster can reuse the same solutions or templates, reducing repetitive work.
- Cold-Start Knowledge Base Construction: Automatically cluster high-frequency question groups from historical conversations, quickly consolidating them into FAQs or standard Q&A pairs.
- New Problem Discovery: When a user's question cannot match any existing cluster, it indicates a new type of problem has been encountered, which can trigger manual intervention or the creation of a new processing flow.
To implement this process, hierarchical clustering is needed.
2. What is Hierarchical Clustering
Hierarchical Clustering is a method for grouping data by building a 'hierarchical tree structure of clusters'. It does not require pre-specifying the number of clusters but generates a Dendrogram, which can be cut at different levels to obtain grouping results of varying granularity.
Simply put, it first "cleans and chops" historical text, converts it into digital fingerprints, then merges them layer by layer according to similarity into several semantic groups and labels them with keywords. Afterwards, when a new question comes in, it just needs to be scored against the "representatives" of each group—if similar enough, it is assigned to that group and directly enters the corresponding processing flow.
Comparison with Other Text Clustering Methods
| Method | Advantages | Disadvantages | Applicable Scenarios |
|---|---|---|---|
| K-Means | Fast, simple to implement | Requires pre-specifying K value; sensitive to initial centers; assumes clusters are convex | Known number of clusters, uniform distribution |
| DBSCAN | No need to specify cluster count; can identify noise | Poor performance on high-dimensional data; parameter sensitive | Spatial data, uneven density |
| Hierarchical Clustering | No need to pre-set cluster count; generates tree structure; highly interpretable results | Higher time complexity O(n²) | Small to medium text sets; need to understand hierarchical relationships |
| LDA Topic Model | Good interpretability; outputs topic-word distribution | Requires pre-setting topic count; high computational cost | Long text topic discovery |
Reasons for Choosing Hierarchical Clustering
- No need to pre-specify the number of clusters: The true number of groups in text data is usually unknown. Hierarchical clustering allows cutting at different granularities via the Dendrogram.
- Results have a hierarchical structure: You can intuitively see which texts are "closer" and which groups are farther apart.
- Friendly to text data: Combined with TF-IDF vectorization and cosine similarity, it can effectively capture text semantics.
- Ward's linkage method: Minimizes within-cluster variance, tending to produce compact clusters of similar sizes.
3. Solution Design
3.1 Overall Architecture
The entire system consists of the following core modules:
┌─────────────────────────────────────────────────────────┐
│ Text Clustering System │
├─────────────┬─────────────┬─────────────┬───────────────┤
│ Text Preprocessing │ Vectorization │ Hierarchical Clustering │ Incremental Prediction │
│ │ │ │ │
│ · Cleaning │ · TF-IDF │ · Ward Linkage │ · Similarity Matching │
│ · Tokenization │ Feature Extraction │ · Agglomerative Merging │ · Threshold Judgment │
│ · Stop Word Filtering │ │ · UUID Identification │ · Cluster Center Update│
└─────────────┴─────────────┴─────────────┴───────────────┘
│
┌────────┴────────┐
│ Auxiliary Capabilities │
├─────────────────┤
│ · Keyword Extraction │
│ · Cluster Info Query │
│ · Result Visualization │
└─────────────────┘
3.2 Core Process
Phase 1: Initial Clustering (fit)
Raw Text → Preprocessing → TF-IDF Vectorization → Hierarchical Clustering → UUID Identification → Calculate Cluster Centers → Extract Keywords
Detailed Steps:
Text Preprocessing
- Remove special characters and punctuation.
- Use a tokenization tool (like jieba) for Chinese word segmentation.
- Filter out stop words and meaningless short words.
TF-IDF Vectorization
- Convert the preprocessed text into TF-IDF feature vectors.
- TF-IDF effectively reflects the importance of words in a document, reducing the weight of high-frequency common words.
Hierarchical Clustering
- Use
AgglomerativeClusteringwith the Ward linkage method. - Ward's method merges clusters by minimizing the within-cluster variance, resulting in more compact clusters.
- The input parameter
n_clusterscontrols the number of clusters for the final cut.
- Use
UUID Identification Mechanism
- The numeric indices (0, 1, 2...) output by the clustering algorithm are mapped to globally unique UUIDs.
- The benefit: avoids ambiguity of numeric indices and supports unique identification in distributed scenarios.
Cluster Center Calculation
- Average all TF-IDF vectors within each cluster to obtain the cluster's center vector.
- The center vector represents the "typical semantic direction" of that cluster.
Keyword Extraction
- Use the TextRank algorithm to extract Top-K keywords from the aggregated text of each cluster.
- Keywords are used to intuitively understand and label the semantic theme of each cluster.
Phase 2: Incremental Prediction (predict)
When new text arrives, the system does not need to re-cluster all texts but uses an incremental approach:
New Text → Preprocessing → TF-IDF Vectorization → Calculate Cosine Similarity with Cluster Centers
│
┌─────────┴──────────┐
│ max_sim ≥ threshold? │
├─────────┬──────────┤
│ Yes │ No │
│ Assign to existing cluster │ Create new cluster │
│ Update center │ Assign UUID │
└─────────┴──────────┘
Key Design:
- Cosine Similarity as Metric: Measures the semantic closeness between the new text vector and each cluster center.
- Threshold Control: When the maximum similarity is below a set threshold (e.g., 0.3), the new text is considered not to belong to any existing cluster, and a new cluster is automatically created.
- Online Cluster Center Update: When new text is assigned to a cluster, the cluster's center vector is updated using a moving average, allowing the center to gradually "drift" to adapt to new data.
4. Core Code Analysis
4.1 Data Structure and Initialization
class TextClusteringSystem:
def __init__(self, stopwords_path=None, n_clusters=5):
self.stopwords = self._load_stopwords(stopwords_path)
self.vectorizer = TfidfVectorizer() # TF-IDF vectorizer
self.cluster_model = AgglomerativeClustering(
n_clusters=n_clusters, linkage="ward" # Hierarchical clustering with Ward linkage
)
self.clusters = None # Cluster UUID for each text
self.cluster_centers = None # Center vectors for each cluster
self.cluster_index_to_uuid = {} # Numeric index → UUID mapping
self.cluster_uuid_to_index = {} # UUID → Numeric index mapping
self.corpus = [] # Preprocessed corpus
self.tfidf_matrix = None # TF-IDF matrix
self.keywords_per_cluster = None # Keywords for each cluster
4.2 Text Preprocessing Pipeline
def _preprocess_text(self, text):
# Step 1: Remove punctuation and special characters
text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
# Step 2: Chinese word segmentation
words = jieba.cut(text)
# Step 3: Stop word filtering + length filtering
words = [word for word in words
if word not in self.stopwords and len(word) > 1]
return " ".join(words)
Preprocessing is the foundation of the entire pipeline. For Chinese text, the quality of word segmentation directly affects the clustering results. In practical applications, custom dictionaries can be supplemented according to the domain.
4.3 Core Logic for Initial Clustering
def fit(self, texts):
# 1. Batch preprocessing
self.corpus = [self._preprocess_text(text) for text in texts]
# 2. TF-IDF vectorization (fit_transform learns the vocabulary and transforms)
self.tfidf_matrix = self.vectorizer.fit_transform(self.corpus)
# 3. Hierarchical clustering, get numeric cluster labels for each sample
numeric_clusters = self.cluster_model.fit_predict(
self.tfidf_matrix.toarray()
)
# 4. Numeric label → UUID mapping (ensures global uniqueness)
for numeric_id in sorted(set(numeric_clusters)):
cluster_uuid = str(uuid.uuid4())
self.cluster_index_to_uuid[numeric_id] = cluster_uuid
self.cluster_uuid_to_index[cluster_uuid] = numeric_id
self.clusters = [
self.cluster_index_to_uuid[nid] for nid in numeric_clusters
]
# 5. Post-processing: calculate centers + extract keywords
self._calculate_cluster_centers()
self._extract_keywords_per_cluster()
return self.clusters
An important design decision here is: using UUIDs instead of numeric indices as cluster identifiers. This avoids the problem of numeric index conflicts between results from different clustering batches and facilitates use in persistence and distributed environments.
4.4 Core Logic for Incremental Prediction
def predict(self, new_texts, threshold=0.3):
processed_texts = [self._preprocess_text(t) for t in new_texts]
new_tfidf = self.vectorizer.transform(processed_texts) # Note: use transform, not fit_transform
new_clusters = []
sorted_uuids = sorted(set(self.clusters), key=...)
for _, tfidf in enumerate(new_tfidf.toarray()):
# Calculate cosine similarity with all existing cluster centers
similarities = [
cosine_similarity([tfidf], [center])[0][0]
for center in self.cluster_centers
]
max_sim = max(similarities)
best_idx = similarities.index(max_sim)
if max_sim >= threshold:
# Assign to existing cluster and update center online
new_clusters.append(sorted_uuids[best_idx])
self.cluster_centers[best_idx] = np.mean(
[self.cluster_centers[best_idx], tfidf], axis=0
)
else:
# Create a brand new cluster
new_uuid = str(uuid.uuid4())
new_clusters.append(new_uuid)
self.cluster_centers.append(tfidf)
# ... update mapping relationships ...
# Synchronously update corpus and cluster list
self.corpus.extend(processed_texts)
self.clusters.extend(new_clusters)
self._extract_keywords_per_cluster() # Refresh keywords
return new_clusters
Key Points of Incremental Prediction:
- Use
transforminstead offit_transform: New text must use the same vocabulary learned during initial clustering, otherwise the vector space will be inconsistent. - Choice of Threshold: A lower threshold makes it easier for new text to be assigned to an existing cluster (potentially introducing noise); a higher threshold makes it easier to create new clusters (potentially causing fragmentation). 0.3 is an empirical starting point and needs tuning based on actual data.
- Online Update of Cluster Centers: Uses a simple mean update method, similar to the moving average concept in online learning.
4.5 Cluster Center Calculation and Keyword Extraction
def _calculate_cluster_centers(self):
"""Averages all vectors within each cluster"""
for cluster_uuid in sorted_uuids:
mask = [uuid == cluster_uuid for uuid in self.clusters]
cluster_vectors = self.tfidf_matrix.toarray()[mask]
center = np.mean(cluster_vectors, axis=0)
self.cluster_centers.append(center)
def _extract_keywords_per_cluster(self, top_n=5):
"""Uses TextRank to extract representative keywords for each cluster"""
for cluster_uuid in set(self.clusters):
# Aggregate all texts in this cluster
cluster_text = " ".join(
self.corpus[i] for i in range(len(self.corpus))
if self.clusters[i] == cluster_uuid
)
keywords = jieba.analyse.textrank(cluster_text, topK=top_n)
self.keywords_per_cluster[cluster_uuid] = keywords
5. Key Technical Details
5.1 The Combination of TF-IDF + Cosine Similarity
TF-IDF converts text into sparse high-dimensional vectors, where:
- TF (Term Frequency): Reflects the importance of a word in the current document.
- IDF (Inverse Document Frequency): Reduces the weight of words that are common across all documents.
Paired with Cosine Similarity, which measures the cosine of the angle between two vectors (range [-1, 1], higher values indicate closer semantics), this combination has been widely validated as effective in text scenarios.
5.2 Mathematical Intuition of Ward's Linkage Method
When merging two clusters, Ward's linkage selects the pair that results in the minimum increase in within-cluster variance after merging. The formula is as follows:
$$\Delta(A, B) = \frac{n_A \cdot n_B}{n_A + n_B} | \mu_A - \mu_B |^2$$
Where $n_A$, $n_B$ are the sample counts of the two clusters, and $\mu_A$, $\mu_B$ are their centroids. This means Ward tends to merge clusters of similar size and close distance, typically producing more balanced results than single linkage or complete linkage.
5.3 Design Considerations for the UUID Identification System
Clustering algorithm output: [0, 0, 1, 1, 1, 2, 2, ...] (Numeric indices)
↓ Mapping
Internal system usage: ["a1b2c3...", "a1b2c3...", "d4e5f6...", "d4e5f6...", ...] (UUIDs)
Benefits of this approach:
- Globally Unique: Avoids confusion caused by two clustering runs both starting numbering from 0.
- Serializable: UUIDs are strings, convenient for JSON serialization and database storage.
- No Information Leakage: External parties cannot infer the number or order of clusters from the UUID.
6. Effects and Applications
6.1 Typical Output Example
After clustering, the system can output grouping information similar to the following:
| Cluster ID | Text Count | Representative Keywords | Semantic Interpretation |
|---|---|---|---|
| Cluster A | 5 | Credit card, apply, limit, repayment, bill | Credit card related inquiries |
| Cluster B | 3 | Mobile banking, transfer, activation, security | Mobile banking services |
| Cluster C | 4 | Wealth management, fund, stock, product, risk | Investment and wealth management related |
| Cluster D (New) | 2 | Tomato, egg, eat, evening | (Newly discovered independent topic) |
6.2 Complete Execution Flow for a New User Question
After the clustering system completes the initial grouping, when a new user asks a question again, the entire system operates according to the following flow:
┌─────────────────────────────────────────────────────────────────────┐
│ User asks a question │
│ "How do I replace a lost credit card?" │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step 1: Text Preprocessing │
│ ───────────────── │
│ Raw text: "How do I replace a lost credit card?" │
│ ↓ Remove punctuation, tokenize, filter stop words │
│ Processed result: "credit card lost replacement" │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step 2: TF-IDF Vectorization (reuse existing vocabulary) │
│ ─────────────────────────────────────── │
│ Use the same vocabulary learned during initial clustering for transform│
│ → Get a sparse feature vector in the same vector space as all historical texts│
│ → Vector dimension = vocabulary size (potentially thousands to tens of thousands)│
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step 3: Calculate Cosine Similarity with Cluster Centers │
│ ─────────────────────────────────── │
│ │
│ New text vector · │
│ ╲ │
│ ╲ sim=0.82 ──→ Cluster A Center [credit card, apply, limit, repayment, bill]│
│ ╲ │
│ ╲ sim=0.31 ──→ Cluster B Center [mobile banking, transfer, activation, security]│
│ ╲ │
│ ╲ sim=0.18 ──→ Cluster C Center [wealth management, fund, stock, product, risk]│
│ ╲ │
│ ╲ sim=0.05 ──→ Cluster D Center [tomato, egg, eat, evening]│
│ │
│ → Max Similarity = 0.82 (Cluster A) │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ max_sim ≥ threshold(0.3)? │
└──────────┬───────────┘
│
┌────────────────┴────────────────┐
│ ✅ Yes: 0.82 ≥ 0.3 │
│ → Assign to existing Cluster A (Credit Card Related) │
└────────────────┬────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step 4: Execute Subsequent Actions │
│ ───────────────── │
│ ① Return clustering result: This question belongs to 'Cluster A' │
│ ② Online update Cluster A's center (moving average) │
│ new_center = mean(old_center, new_vector) │
│ ③ Refresh Cluster A's keyword list │
│ ④ Append new text to the corpus │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step 5: Downstream Business Consumption │
│ ───────────────── │
│ After the Agent obtains the 'Cluster A' identifier: │
│ · Query the processing plan / FAQ template / automation tool for Cluster A│
│ · Directly execute the corresponding operation or return a standardized answer│
│ · No need for multiple rounds of follow-up, achieving "one-sentence direct access"│
└─────────────────────────────────────────────────────────────────────┘
Another Scenario: When encountering a completely new question
If the user asks "What's the weather like today?":
Step 1~2: Same as above (preprocessing + vectorization)
Step 3: Calculate similarity
→ With Cluster A (Credit Card): sim=0.02
→ With Cluster B (Mobile Banking): sim=0.01
→ With Cluster C (Wealth Management): sim=0.03
→ With Cluster D (Tomato): sim=0.08
→ Max Similarity = 0.08
Judgment: max_sim(0.08) < threshold(0.3)
→ ❌ Does not belong to any existing cluster!
Actions:
① Create a brand new Cluster E, assign a new UUID
② Use this text as the first member and initial center of Cluster E
③ Trigger "New Problem Discovery" notification → Can be transferred to manual processing or consolidated as a new type
6.3 Limitations and Improvement Directions
| Limitation | Description | Improvement Ideas |
|---|---|---|
| TF-IDF ignores semantics | Based on lexical matching, synonyms cannot be identified | Use pre-trained Embeddings (like sentence-transformers) instead of TF-IDF |
| Hierarchical clustering O(n²) complexity | Slow for large-scale data (tens of thousands and above) | Use approximate methods (like MinHash LSH) for coarse screening first, then perform precise clustering on candidate sets |
| Threshold requires manual setting | Optimal threshold differs for different data distributions | Introduce adaptive thresholds or statistically-based methods for automatic determination |
| Cluster center drift | Online updates may cause center shift | Set freezing conditions or periodically re-cluster for correction |
7. Summary
This article shared a solution for building a complete automatic text grouping system through the combination of TF-IDF vectorization + Ward hierarchical clustering + incremental prediction. If you have similar needs, feel free to discuss in the comments section.
Top 1 from juejin.cn, machine-translated. The original thread is authoritative.
How is this data stored?