How a Q&A Agent Uses Hierarchical Clustering to Route Customer Questions Without Retraining

1. Background

Recently, we needed to develop a B-end Agent Q&A system. When working on user question identification, we faced the following problem:

The phrasing of each user question varies greatly, but the problems they want to solve can often be summarized into a limited number of categories.

For example:

User's Original Question	Semantic Category
"How do I apply for a refund?"	→ Refund Process
"I don't want it anymore, can I get my money back?"	→ Refund Process
"How long does it take to get money back after canceling an order?"	→ Refund Process
"How do I exchange a broken item for a new one?"	→ After-Sales Repair/Exchange
"The received product has quality issues"	→ After-Sales Repair/Exchange
"How long is the warranty period?"	→ After-Sales Repair/Exchange
"How many days does shipping usually take?"	→ Logistics Inquiry
"Where is my package?"	→ Logistics Inquiry
"Why haven't I received the goods yet?"	→ Logistics Inquiry

As you can see, on the surface: the 8 questions are phrased completely differently, with no duplicates. In essence: they belong to only 3 topic categories (Refund, After-Sales, Logistics).

If a system can automatically complete this "User Question → Semantic Cluster" mapping, it can bring huge efficiency gains, such as:

Intelligent Routing: When a user asks "Can I get my money back?", the Agent directly identifies it as belonging to the 'Refund Process' cluster and can call the corresponding processing capability without multiple rounds of follow-up questions.
Batch Processing: Questions within the same cluster can reuse the same solutions or templates, reducing repetitive work.
Cold-Start Knowledge Base Construction: Automatically cluster high-frequency question groups from historical conversations, quickly consolidating them into FAQs or standard Q&A pairs.
New Problem Discovery: When a user's question cannot match any existing cluster, it indicates a new type of problem has been encountered, which can trigger manual intervention or the creation of a new processing flow.

To implement this process, hierarchical clustering is needed.

2. What is Hierarchical Clustering

Hierarchical Clustering is a method for grouping data by building a 'hierarchical tree structure of clusters'. It does not require pre-specifying the number of clusters but generates a Dendrogram, which can be cut at different levels to obtain grouping results of varying granularity.

Simply put, it first "cleans and chops" historical text, converts it into digital fingerprints, then merges them layer by layer according to similarity into several semantic groups and labels them with keywords. Afterwards, when a new question comes in, it just needs to be scored against the "representatives" of each group—if similar enough, it is assigned to that group and directly enters the corresponding processing flow.

Comparison with Other Text Clustering Methods

Method	Advantages	Disadvantages	Applicable Scenarios
K-Means	Fast, simple to implement	Requires pre-specifying K value; sensitive to initial centers; assumes clusters are convex	Known number of clusters, uniform distribution
DBSCAN	No need to specify cluster count; can identify noise	Poor performance on high-dimensional data; parameter sensitive	Spatial data, uneven density
Hierarchical Clustering	No need to pre-set cluster count; generates tree structure; highly interpretable results	Higher time complexity O(n²)	Small to medium text sets; need to understand hierarchical relationships
LDA Topic Model	Good interpretability; outputs topic-word distribution	Requires pre-setting topic count; high computational cost	Long text topic discovery

Reasons for Choosing Hierarchical Clustering

No need to pre-specify the number of clusters: The true number of groups in text data is usually unknown. Hierarchical clustering allows cutting at different granularities via the Dendrogram.
Results have a hierarchical structure: You can intuitively see which texts are "closer" and which groups are farther apart.
Friendly to text data: Combined with TF-IDF vectorization and cosine similarity, it can effectively capture text semantics.
Ward's linkage method: Minimizes within-cluster variance, tending to produce compact clusters of similar sizes.

3. Solution Design

3.1 Overall Architecture

The entire system consists of the following core modules:

┌─────────────────────────────────────────────────────────┐
│                    Text Clustering System                │
├─────────────┬─────────────┬─────────────┬───────────────┤
│  Text Preprocessing  │   Vectorization  │  Hierarchical Clustering  │  Incremental Prediction  │
│             │             │             │               │
│ · Cleaning  │ · TF-IDF    │ · Ward Linkage │ · Similarity Matching  │
│ · Tokenization  │   Feature Extraction  │ · Agglomerative Merging  │ · Threshold Judgment    │
│ · Stop Word Filtering │             │ · UUID Identification  │ · Cluster Center Update│
└─────────────┴─────────────┴─────────────┴───────────────┘
                              │
                     ┌────────┴────────┐
                     │   Auxiliary Capabilities       │
                     ├─────────────────┤
                     │ · Keyword Extraction     │
                     │ · Cluster Info Query   │
                     │ · Result Visualization     │
                     └─────────────────┘

3.2 Core Process

Phase 1: Initial Clustering (fit)

Raw Text → Preprocessing → TF-IDF Vectorization → Hierarchical Clustering → UUID Identification → Calculate Cluster Centers → Extract Keywords

Detailed Steps:

Text Preprocessing
- Remove special characters and punctuation.
- Use a tokenization tool (like jieba) for Chinese word segmentation.
- Filter out stop words and meaningless short words.
TF-IDF Vectorization
- Convert the preprocessed text into TF-IDF feature vectors.
- TF-IDF effectively reflects the importance of words in a document, reducing the weight of high-frequency common words.
Hierarchical Clustering
- Use AgglomerativeClustering with the Ward linkage method.
- Ward's method merges clusters by minimizing the within-cluster variance, resulting in more compact clusters.
- The input parameter n_clusters controls the number of clusters for the final cut.
UUID Identification Mechanism
- The numeric indices (0, 1, 2...) output by the clustering algorithm are mapped to globally unique UUIDs.
- The benefit: avoids ambiguity of numeric indices and supports unique identification in distributed scenarios.
Cluster Center Calculation
- Average all TF-IDF vectors within each cluster to obtain the cluster's center vector.
- The center vector represents the "typical semantic direction" of that cluster.
Keyword Extraction
- Use the TextRank algorithm to extract Top-K keywords from the aggregated text of each cluster.
- Keywords are used to intuitively understand and label the semantic theme of each cluster.

Phase 2: Incremental Prediction (predict)

When new text arrives, the system does not need to re-cluster all texts but uses an incremental approach:

New Text → Preprocessing → TF-IDF Vectorization → Calculate Cosine Similarity with Cluster Centers
                                              │
                                    ┌─────────┴──────────┐
                                    │  max_sim ≥ threshold?    │
                                    ├─────────┬──────────┤
                                    │   Yes    │    No    │
                                    │ Assign to existing cluster │ Create new cluster │
                                    │ Update center   │ Assign UUID │
                                    └─────────┴──────────┘

Key Design:

Cosine Similarity as Metric: Measures the semantic closeness between the new text vector and each cluster center.
Threshold Control: When the maximum similarity is below a set threshold (e.g., 0.3), the new text is considered not to belong to any existing cluster, and a new cluster is automatically created.
Online Cluster Center Update: When new text is assigned to a cluster, the cluster's center vector is updated using a moving average, allowing the center to gradually "drift" to adapt to new data.

4. Core Code Analysis

4.1 Data Structure and Initialization

class TextClusteringSystem:
    def __init__(self, stopwords_path=None, n_clusters=5):
        self.stopwords = self._load_stopwords(stopwords_path)
        self.vectorizer = TfidfVectorizer()          # TF-IDF vectorizer
        self.cluster_model = AgglomerativeClustering(
            n_clusters=n_clusters, linkage="ward"     # Hierarchical clustering with Ward linkage
        )
        self.clusters = None                          # Cluster UUID for each text
        self.cluster_centers = None                   # Center vectors for each cluster
        self.cluster_index_to_uuid = {}              # Numeric index → UUID mapping
        self.cluster_uuid_to_index = {}              # UUID → Numeric index mapping
        self.corpus = []                             # Preprocessed corpus
        self.tfidf_matrix = None                     # TF-IDF matrix
        self.keywords_per_cluster = None             # Keywords for each cluster

4.2 Text Preprocessing Pipeline

def _preprocess_text(self, text):
    # Step 1: Remove punctuation and special characters
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()

    # Step 2: Chinese word segmentation
    words = jieba.cut(text)

    # Step 3: Stop word filtering + length filtering
    words = [word for word in words 
             if word not in self.stopwords and len(word) > 1]

    return " ".join(words)

Preprocessing is the foundation of the entire pipeline. For Chinese text, the quality of word segmentation directly affects the clustering results. In practical applications, custom dictionaries can be supplemented according to the domain.

4.3 Core Logic for Initial Clustering

def fit(self, texts):
    # 1. Batch preprocessing
    self.corpus = [self._preprocess_text(text) for text in texts]

    # 2. TF-IDF vectorization (fit_transform learns the vocabulary and transforms)
    self.tfidf_matrix = self.vectorizer.fit_transform(self.corpus)

    # 3. Hierarchical clustering, get numeric cluster labels for each sample
    numeric_clusters = self.cluster_model.fit_predict(
        self.tfidf_matrix.toarray()
    )

    # 4. Numeric label → UUID mapping (ensures global uniqueness)
    for numeric_id in sorted(set(numeric_clusters)):
        cluster_uuid = str(uuid.uuid4())
        self.cluster_index_to_uuid[numeric_id] = cluster_uuid
        self.cluster_uuid_to_index[cluster_uuid] = numeric_id

    self.clusters = [
        self.cluster_index_to_uuid[nid] for nid in numeric_clusters
    ]

    # 5. Post-processing: calculate centers + extract keywords
    self._calculate_cluster_centers()
    self._extract_keywords_per_cluster()

    return self.clusters

An important design decision here is: using UUIDs instead of numeric indices as cluster identifiers. This avoids the problem of numeric index conflicts between results from different clustering batches and facilitates use in persistence and distributed environments.

4.4 Core Logic for Incremental Prediction

def predict(self, new_texts, threshold=0.3):
    processed_texts = [self._preprocess_text(t) for t in new_texts]
    new_tfidf = self.vectorizer.transform(processed_texts)  # Note: use transform, not fit_transform

    new_clusters = []
    sorted_uuids = sorted(set(self.clusters), key=...)

    for _, tfidf in enumerate(new_tfidf.toarray()):
        # Calculate cosine similarity with all existing cluster centers
        similarities = [
            cosine_similarity([tfidf], [center])[0][0]
            for center in self.cluster_centers
        ]
        max_sim = max(similarities)
        best_idx = similarities.index(max_sim)

        if max_sim >= threshold:
            # Assign to existing cluster and update center online
            new_clusters.append(sorted_uuids[best_idx])
            self.cluster_centers[best_idx] = np.mean(
                [self.cluster_centers[best_idx], tfidf], axis=0
            )
        else:
            # Create a brand new cluster
            new_uuid = str(uuid.uuid4())
            new_clusters.append(new_uuid)
            self.cluster_centers.append(tfidf)
            # ... update mapping relationships ...

    # Synchronously update corpus and cluster list
    self.corpus.extend(processed_texts)
    self.clusters.extend(new_clusters)
    self._extract_keywords_per_cluster()  # Refresh keywords

    return new_clusters

Key Points of Incremental Prediction:

Use transform instead of fit_transform: New text must use the same vocabulary learned during initial clustering, otherwise the vector space will be inconsistent.
Choice of Threshold: A lower threshold makes it easier for new text to be assigned to an existing cluster (potentially introducing noise); a higher threshold makes it easier to create new clusters (potentially causing fragmentation). 0.3 is an empirical starting point and needs tuning based on actual data.
Online Update of Cluster Centers: Uses a simple mean update method, similar to the moving average concept in online learning.

4.5 Cluster Center Calculation and Keyword Extraction

def _calculate_cluster_centers(self):
    """Averages all vectors within each cluster"""
    for cluster_uuid in sorted_uuids:
        mask = [uuid == cluster_uuid for uuid in self.clusters]
        cluster_vectors = self.tfidf_matrix.toarray()[mask]
        center = np.mean(cluster_vectors, axis=0)
        self.cluster_centers.append(center)

def _extract_keywords_per_cluster(self, top_n=5):
    """Uses TextRank to extract representative keywords for each cluster"""
    for cluster_uuid in set(self.clusters):
        # Aggregate all texts in this cluster
        cluster_text = " ".join(
            self.corpus[i] for i in range(len(self.corpus))
            if self.clusters[i] == cluster_uuid
        )
        keywords = jieba.analyse.textrank(cluster_text, topK=top_n)
        self.keywords_per_cluster[cluster_uuid] = keywords

5. Key Technical Details

5.1 The Combination of TF-IDF + Cosine Similarity

TF-IDF converts text into sparse high-dimensional vectors, where:

TF (Term Frequency): Reflects the importance of a word in the current document.
IDF (Inverse Document Frequency): Reduces the weight of words that are common across all documents.

Paired with Cosine Similarity, which measures the cosine of the angle between two vectors (range [-1, 1], higher values indicate closer semantics), this combination has been widely validated as effective in text scenarios.

5.2 Mathematical Intuition of Ward's Linkage Method

When merging two clusters, Ward's linkage selects the pair that results in the minimum increase in within-cluster variance after merging. The formula is as follows:

$$\Delta(A, B) = \frac{n_A \cdot n_B}{n_A + n_B} | \mu_A - \mu_B |^2$$

Where $n_A$, $n_B$ are the sample counts of the two clusters, and $\mu_A$, $\mu_B$ are their centroids. This means Ward tends to merge clusters of similar size and close distance, typically producing more balanced results than single linkage or complete linkage.

5.3 Design Considerations for the UUID Identification System

Clustering algorithm output:  [0, 0, 1, 1, 1, 2, 2, ...]    (Numeric indices)
                    ↓ Mapping
Internal system usage:  ["a1b2c3...", "a1b2c3...", "d4e5f6...", "d4e5f6...", ...]  (UUIDs)

Benefits of this approach:

Globally Unique: Avoids confusion caused by two clustering runs both starting numbering from 0.
Serializable: UUIDs are strings, convenient for JSON serialization and database storage.
No Information Leakage: External parties cannot infer the number or order of clusters from the UUID.

6. Effects and Applications

6.1 Typical Output Example

After clustering, the system can output grouping information similar to the following:

Cluster ID	Text Count	Representative Keywords	Semantic Interpretation
Cluster A	5	Credit card, apply, limit, repayment, bill	Credit card related inquiries
Cluster B	3	Mobile banking, transfer, activation, security	Mobile banking services
Cluster C	4	Wealth management, fund, stock, product, risk	Investment and wealth management related
Cluster D (New)	2	Tomato, egg, eat, evening	(Newly discovered independent topic)

6.2 Complete Execution Flow for a New User Question

After the clustering system completes the initial grouping, when a new user asks a question again, the entire system operates according to the following flow:

┌─────────────────────────────────────────────────────────────────────┐
│                        User asks a question                         │
│                   "How do I replace a lost credit card?"              │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 1: Text Preprocessing                                         │
│  ─────────────────                                                   │
│  Raw text: "How do I replace a lost credit card?"                    │
│       ↓ Remove punctuation, tokenize, filter stop words              │
│  Processed result: "credit card lost replacement"                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 2: TF-IDF Vectorization (reuse existing vocabulary)            │
│  ───────────────────────────────────────                             │
│  Use the same vocabulary learned during initial clustering for transform│
│  → Get a sparse feature vector in the same vector space as all historical texts│
│  → Vector dimension = vocabulary size (potentially thousands to tens of thousands)│
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 3: Calculate Cosine Similarity with Cluster Centers            │
│  ───────────────────────────────────                                 │
│                                                                     │
│   New text vector ·                                                 │
│          ╲                                                          │
│           ╲  sim=0.82 ──→ Cluster A Center [credit card, apply, limit, repayment, bill]│
│            ╲                                                         │
│             ╲ sim=0.31 ──→ Cluster B Center [mobile banking, transfer, activation, security]│
│              ╲                                                        │
│               ╲ sim=0.18 ──→ Cluster C Center [wealth management, fund, stock, product, risk]│
│                ╲                                                       │
│                 ╲ sim=0.05 ──→ Cluster D Center [tomato, egg, eat, evening]│
│                                                                     │
│   → Max Similarity = 0.82 (Cluster A)                               │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  max_sim ≥ threshold(0.3)? │
                    └──────────┬───────────┘
                               │
              ┌────────────────┴────────────────┐
              │  ✅ Yes: 0.82 ≥ 0.3              │
              │  → Assign to existing Cluster A (Credit Card Related) │
              └────────────────┬────────────────┘
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 4: Execute Subsequent Actions                                 │
│  ─────────────────                                                   │
│  ① Return clustering result: This question belongs to 'Cluster A'   │
│  ② Online update Cluster A's center (moving average)                │
│     new_center = mean(old_center, new_vector)                        │
│  ③ Refresh Cluster A's keyword list                                 │
│  ④ Append new text to the corpus                                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 5: Downstream Business Consumption                            │
│  ─────────────────                                                   │
│  After the Agent obtains the 'Cluster A' identifier:                 │
│  · Query the processing plan / FAQ template / automation tool for Cluster A│
│  · Directly execute the corresponding operation or return a standardized answer│
│  · No need for multiple rounds of follow-up, achieving "one-sentence direct access"│
└─────────────────────────────────────────────────────────────────────┘

Another Scenario: When encountering a completely new question

If the user asks "What's the weather like today?":

Step 1~2: Same as above (preprocessing + vectorization)

Step 3: Calculate similarity
  → With Cluster A (Credit Card): sim=0.02
  → With Cluster B (Mobile Banking): sim=0.01
  → With Cluster C (Wealth Management): sim=0.03
  → With Cluster D (Tomato): sim=0.08
  → Max Similarity = 0.08

Judgment: max_sim(0.08) < threshold(0.3)
  → ❌ Does not belong to any existing cluster!

Actions:
  ① Create a brand new Cluster E, assign a new UUID
  ② Use this text as the first member and initial center of Cluster E
  ③ Trigger "New Problem Discovery" notification → Can be transferred to manual processing or consolidated as a new type

6.3 Limitations and Improvement Directions

Limitation	Description	Improvement Ideas
TF-IDF ignores semantics	Based on lexical matching, synonyms cannot be identified	Use pre-trained Embeddings (like sentence-transformers) instead of TF-IDF
Hierarchical clustering O(n²) complexity	Slow for large-scale data (tens of thousands and above)	Use approximate methods (like MinHash LSH) for coarse screening first, then perform precise clustering on candidate sets
Threshold requires manual setting	Optimal threshold differs for different data distributions	Introduce adaptive thresholds or statistically-based methods for automatic determination
Cluster center drift	Online updates may cause center shift	Set freezing conditions or periodically re-cluster for correction

7. Summary

This article shared a solution for building a complete automatic text grouping system through the combination of TF-IDF vectorization + Ward hierarchical clustering + incremental prediction. If you have similar needs, feel free to discuss in the comments section.

Comments

Top 1 from juejin.cn, machine-translated. The original thread is authoritative.

用户560514818461

How is this data stored?