跪拜 Guibai
← Back to the summary

How a Q&A Agent Uses Hierarchical Clustering to Route Customer Questions Without Retraining

1. Background

Recently, we needed to develop a B-end Agent Q&A system. When working on user question identification, we faced the following problem:

The phrasing of each user question varies greatly, but the problems they want to solve can often be summarized into a limited number of categories.

For example:

User's Original Question Semantic Category
"How do I apply for a refund?" Refund Process
"I don't want it anymore, can I get my money back?" Refund Process
"How long does it take to get money back after canceling an order?" Refund Process
"How do I exchange a broken item for a new one?" After-Sales Repair/Exchange
"The received product has quality issues" After-Sales Repair/Exchange
"How long is the warranty period?" After-Sales Repair/Exchange
"How many days does shipping usually take?" Logistics Inquiry
"Where is my package?" Logistics Inquiry
"Why haven't I received the goods yet?" Logistics Inquiry

As you can see, on the surface: the 8 questions are phrased completely differently, with no duplicates. In essence: they belong to only 3 topic categories (Refund, After-Sales, Logistics).

If a system can automatically complete this "User Question → Semantic Cluster" mapping, it can bring huge efficiency gains, such as:

To implement this process, hierarchical clustering is needed.


2. What is Hierarchical Clustering

Hierarchical Clustering is a method for grouping data by building a 'hierarchical tree structure of clusters'. It does not require pre-specifying the number of clusters but generates a Dendrogram, which can be cut at different levels to obtain grouping results of varying granularity.

Simply put, it first "cleans and chops" historical text, converts it into digital fingerprints, then merges them layer by layer according to similarity into several semantic groups and labels them with keywords. Afterwards, when a new question comes in, it just needs to be scored against the "representatives" of each group—if similar enough, it is assigned to that group and directly enters the corresponding processing flow.

Comparison with Other Text Clustering Methods

Method Advantages Disadvantages Applicable Scenarios
K-Means Fast, simple to implement Requires pre-specifying K value; sensitive to initial centers; assumes clusters are convex Known number of clusters, uniform distribution
DBSCAN No need to specify cluster count; can identify noise Poor performance on high-dimensional data; parameter sensitive Spatial data, uneven density
Hierarchical Clustering No need to pre-set cluster count; generates tree structure; highly interpretable results Higher time complexity O(n²) Small to medium text sets; need to understand hierarchical relationships
LDA Topic Model Good interpretability; outputs topic-word distribution Requires pre-setting topic count; high computational cost Long text topic discovery

Reasons for Choosing Hierarchical Clustering

  1. No need to pre-specify the number of clusters: The true number of groups in text data is usually unknown. Hierarchical clustering allows cutting at different granularities via the Dendrogram.
  2. Results have a hierarchical structure: You can intuitively see which texts are "closer" and which groups are farther apart.
  3. Friendly to text data: Combined with TF-IDF vectorization and cosine similarity, it can effectively capture text semantics.
  4. Ward's linkage method: Minimizes within-cluster variance, tending to produce compact clusters of similar sizes.

3. Solution Design

3.1 Overall Architecture

The entire system consists of the following core modules:

┌─────────────────────────────────────────────────────────┐
│                    Text Clustering System                │
├─────────────┬─────────────┬─────────────┬───────────────┤
│  Text Preprocessing  │   Vectorization  │  Hierarchical Clustering  │  Incremental Prediction  │
│             │             │             │               │
│ · Cleaning  │ · TF-IDF    │ · Ward Linkage │ · Similarity Matching  │
│ · Tokenization  │   Feature Extraction  │ · Agglomerative Merging  │ · Threshold Judgment    │
│ · Stop Word Filtering │             │ · UUID Identification  │ · Cluster Center Update│
└─────────────┴─────────────┴─────────────┴───────────────┘
                              │
                     ┌────────┴────────┐
                     │   Auxiliary Capabilities       │
                     ├─────────────────┤
                     │ · Keyword Extraction     │
                     │ · Cluster Info Query   │
                     │ · Result Visualization     │
                     └─────────────────┘

3.2 Core Process

Phase 1: Initial Clustering (fit)

Raw Text → Preprocessing → TF-IDF Vectorization → Hierarchical Clustering → UUID Identification → Calculate Cluster Centers → Extract Keywords

Detailed Steps:

  1. Text Preprocessing

    • Remove special characters and punctuation.
    • Use a tokenization tool (like jieba) for Chinese word segmentation.
    • Filter out stop words and meaningless short words.
  2. TF-IDF Vectorization

    • Convert the preprocessed text into TF-IDF feature vectors.
    • TF-IDF effectively reflects the importance of words in a document, reducing the weight of high-frequency common words.
  3. Hierarchical Clustering

    • Use AgglomerativeClustering with the Ward linkage method.
    • Ward's method merges clusters by minimizing the within-cluster variance, resulting in more compact clusters.
    • The input parameter n_clusters controls the number of clusters for the final cut.
  4. UUID Identification Mechanism

    • The numeric indices (0, 1, 2...) output by the clustering algorithm are mapped to globally unique UUIDs.
    • The benefit: avoids ambiguity of numeric indices and supports unique identification in distributed scenarios.
  5. Cluster Center Calculation

    • Average all TF-IDF vectors within each cluster to obtain the cluster's center vector.
    • The center vector represents the "typical semantic direction" of that cluster.
  6. Keyword Extraction

    • Use the TextRank algorithm to extract Top-K keywords from the aggregated text of each cluster.
    • Keywords are used to intuitively understand and label the semantic theme of each cluster.

Phase 2: Incremental Prediction (predict)

When new text arrives, the system does not need to re-cluster all texts but uses an incremental approach:

New Text → Preprocessing → TF-IDF Vectorization → Calculate Cosine Similarity with Cluster Centers
                                              │
                                    ┌─────────┴──────────┐
                                    │  max_sim ≥ threshold?    │
                                    ├─────────┬──────────┤
                                    │   Yes    │    No    │
                                    │ Assign to existing cluster │ Create new cluster │
                                    │ Update center   │ Assign UUID │
                                    └─────────┴──────────┘

Key Design:


4. Core Code Analysis

4.1 Data Structure and Initialization

class TextClusteringSystem:
    def __init__(self, stopwords_path=None, n_clusters=5):
        self.stopwords = self._load_stopwords(stopwords_path)
        self.vectorizer = TfidfVectorizer()          # TF-IDF vectorizer
        self.cluster_model = AgglomerativeClustering(
            n_clusters=n_clusters, linkage="ward"     # Hierarchical clustering with Ward linkage
        )
        self.clusters = None                          # Cluster UUID for each text
        self.cluster_centers = None                   # Center vectors for each cluster
        self.cluster_index_to_uuid = {}              # Numeric index → UUID mapping
        self.cluster_uuid_to_index = {}              # UUID → Numeric index mapping
        self.corpus = []                             # Preprocessed corpus
        self.tfidf_matrix = None                     # TF-IDF matrix
        self.keywords_per_cluster = None             # Keywords for each cluster

4.2 Text Preprocessing Pipeline

def _preprocess_text(self, text):
    # Step 1: Remove punctuation and special characters
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()

    # Step 2: Chinese word segmentation
    words = jieba.cut(text)

    # Step 3: Stop word filtering + length filtering
    words = [word for word in words 
             if word not in self.stopwords and len(word) > 1]

    return " ".join(words)

Preprocessing is the foundation of the entire pipeline. For Chinese text, the quality of word segmentation directly affects the clustering results. In practical applications, custom dictionaries can be supplemented according to the domain.

4.3 Core Logic for Initial Clustering

def fit(self, texts):
    # 1. Batch preprocessing
    self.corpus = [self._preprocess_text(text) for text in texts]

    # 2. TF-IDF vectorization (fit_transform learns the vocabulary and transforms)
    self.tfidf_matrix = self.vectorizer.fit_transform(self.corpus)

    # 3. Hierarchical clustering, get numeric cluster labels for each sample
    numeric_clusters = self.cluster_model.fit_predict(
        self.tfidf_matrix.toarray()
    )

    # 4. Numeric label → UUID mapping (ensures global uniqueness)
    for numeric_id in sorted(set(numeric_clusters)):
        cluster_uuid = str(uuid.uuid4())
        self.cluster_index_to_uuid[numeric_id] = cluster_uuid
        self.cluster_uuid_to_index[cluster_uuid] = numeric_id

    self.clusters = [
        self.cluster_index_to_uuid[nid] for nid in numeric_clusters
    ]

    # 5. Post-processing: calculate centers + extract keywords
    self._calculate_cluster_centers()
    self._extract_keywords_per_cluster()

    return self.clusters

An important design decision here is: using UUIDs instead of numeric indices as cluster identifiers. This avoids the problem of numeric index conflicts between results from different clustering batches and facilitates use in persistence and distributed environments.

4.4 Core Logic for Incremental Prediction

def predict(self, new_texts, threshold=0.3):
    processed_texts = [self._preprocess_text(t) for t in new_texts]
    new_tfidf = self.vectorizer.transform(processed_texts)  # Note: use transform, not fit_transform

    new_clusters = []
    sorted_uuids = sorted(set(self.clusters), key=...)

    for _, tfidf in enumerate(new_tfidf.toarray()):
        # Calculate cosine similarity with all existing cluster centers
        similarities = [
            cosine_similarity([tfidf], [center])[0][0]
            for center in self.cluster_centers
        ]
        max_sim = max(similarities)
        best_idx = similarities.index(max_sim)

        if max_sim >= threshold:
            # Assign to existing cluster and update center online
            new_clusters.append(sorted_uuids[best_idx])
            self.cluster_centers[best_idx] = np.mean(
                [self.cluster_centers[best_idx], tfidf], axis=0
            )
        else:
            # Create a brand new cluster
            new_uuid = str(uuid.uuid4())
            new_clusters.append(new_uuid)
            self.cluster_centers.append(tfidf)
            # ... update mapping relationships ...

    # Synchronously update corpus and cluster list
    self.corpus.extend(processed_texts)
    self.clusters.extend(new_clusters)
    self._extract_keywords_per_cluster()  # Refresh keywords

    return new_clusters

Key Points of Incremental Prediction:

  1. Use transform instead of fit_transform: New text must use the same vocabulary learned during initial clustering, otherwise the vector space will be inconsistent.
  2. Choice of Threshold: A lower threshold makes it easier for new text to be assigned to an existing cluster (potentially introducing noise); a higher threshold makes it easier to create new clusters (potentially causing fragmentation). 0.3 is an empirical starting point and needs tuning based on actual data.
  3. Online Update of Cluster Centers: Uses a simple mean update method, similar to the moving average concept in online learning.

4.5 Cluster Center Calculation and Keyword Extraction

def _calculate_cluster_centers(self):
    """Averages all vectors within each cluster"""
    for cluster_uuid in sorted_uuids:
        mask = [uuid == cluster_uuid for uuid in self.clusters]
        cluster_vectors = self.tfidf_matrix.toarray()[mask]
        center = np.mean(cluster_vectors, axis=0)
        self.cluster_centers.append(center)

def _extract_keywords_per_cluster(self, top_n=5):
    """Uses TextRank to extract representative keywords for each cluster"""
    for cluster_uuid in set(self.clusters):
        # Aggregate all texts in this cluster
        cluster_text = " ".join(
            self.corpus[i] for i in range(len(self.corpus))
            if self.clusters[i] == cluster_uuid
        )
        keywords = jieba.analyse.textrank(cluster_text, topK=top_n)
        self.keywords_per_cluster[cluster_uuid] = keywords

5. Key Technical Details

5.1 The Combination of TF-IDF + Cosine Similarity

TF-IDF converts text into sparse high-dimensional vectors, where:

Paired with Cosine Similarity, which measures the cosine of the angle between two vectors (range [-1, 1], higher values indicate closer semantics), this combination has been widely validated as effective in text scenarios.

5.2 Mathematical Intuition of Ward's Linkage Method

When merging two clusters, Ward's linkage selects the pair that results in the minimum increase in within-cluster variance after merging. The formula is as follows:

$$\Delta(A, B) = \frac{n_A \cdot n_B}{n_A + n_B} | \mu_A - \mu_B |^2$$

Where $n_A$, $n_B$ are the sample counts of the two clusters, and $\mu_A$, $\mu_B$ are their centroids. This means Ward tends to merge clusters of similar size and close distance, typically producing more balanced results than single linkage or complete linkage.

5.3 Design Considerations for the UUID Identification System

Clustering algorithm output:  [0, 0, 1, 1, 1, 2, 2, ...]    (Numeric indices)
                    ↓ Mapping
Internal system usage:  ["a1b2c3...", "a1b2c3...", "d4e5f6...", "d4e5f6...", ...]  (UUIDs)

Benefits of this approach:


6. Effects and Applications

6.1 Typical Output Example

After clustering, the system can output grouping information similar to the following:

Cluster ID Text Count Representative Keywords Semantic Interpretation
Cluster A 5 Credit card, apply, limit, repayment, bill Credit card related inquiries
Cluster B 3 Mobile banking, transfer, activation, security Mobile banking services
Cluster C 4 Wealth management, fund, stock, product, risk Investment and wealth management related
Cluster D (New) 2 Tomato, egg, eat, evening (Newly discovered independent topic)

6.2 Complete Execution Flow for a New User Question

After the clustering system completes the initial grouping, when a new user asks a question again, the entire system operates according to the following flow:

┌─────────────────────────────────────────────────────────────────────┐
│                        User asks a question                         │
│                   "How do I replace a lost credit card?"              │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 1: Text Preprocessing                                         │
│  ─────────────────                                                   │
│  Raw text: "How do I replace a lost credit card?"                    │
│       ↓ Remove punctuation, tokenize, filter stop words              │
│  Processed result: "credit card lost replacement"                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 2: TF-IDF Vectorization (reuse existing vocabulary)            │
│  ───────────────────────────────────────                             │
│  Use the same vocabulary learned during initial clustering for transform│
│  → Get a sparse feature vector in the same vector space as all historical texts│
│  → Vector dimension = vocabulary size (potentially thousands to tens of thousands)│
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 3: Calculate Cosine Similarity with Cluster Centers            │
│  ───────────────────────────────────                                 │
│                                                                     │
│   New text vector ·                                                 │
│          ╲                                                          │
│           ╲  sim=0.82 ──→ Cluster A Center [credit card, apply, limit, repayment, bill]│
│            ╲                                                         │
│             ╲ sim=0.31 ──→ Cluster B Center [mobile banking, transfer, activation, security]│
│              ╲                                                        │
│               ╲ sim=0.18 ──→ Cluster C Center [wealth management, fund, stock, product, risk]│
│                ╲                                                       │
│                 ╲ sim=0.05 ──→ Cluster D Center [tomato, egg, eat, evening]│
│                                                                     │
│   → Max Similarity = 0.82 (Cluster A)                               │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  max_sim ≥ threshold(0.3)? │
                    └──────────┬───────────┘
                               │
              ┌────────────────┴────────────────┐
              │  ✅ Yes: 0.82 ≥ 0.3              │
              │  → Assign to existing Cluster A (Credit Card Related) │
              └────────────────┬────────────────┘
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 4: Execute Subsequent Actions                                 │
│  ─────────────────                                                   │
│  ① Return clustering result: This question belongs to 'Cluster A'   │
│  ② Online update Cluster A's center (moving average)                │
│     new_center = mean(old_center, new_vector)                        │
│  ③ Refresh Cluster A's keyword list                                 │
│  ④ Append new text to the corpus                                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Step 5: Downstream Business Consumption                            │
│  ─────────────────                                                   │
│  After the Agent obtains the 'Cluster A' identifier:                 │
│  · Query the processing plan / FAQ template / automation tool for Cluster A│
│  · Directly execute the corresponding operation or return a standardized answer│
│  · No need for multiple rounds of follow-up, achieving "one-sentence direct access"│
└─────────────────────────────────────────────────────────────────────┘

Another Scenario: When encountering a completely new question

If the user asks "What's the weather like today?":

Step 1~2: Same as above (preprocessing + vectorization)

Step 3: Calculate similarity
  → With Cluster A (Credit Card): sim=0.02
  → With Cluster B (Mobile Banking): sim=0.01
  → With Cluster C (Wealth Management): sim=0.03
  → With Cluster D (Tomato): sim=0.08
  → Max Similarity = 0.08

Judgment: max_sim(0.08) < threshold(0.3)
  → ❌ Does not belong to any existing cluster!

Actions:
  ① Create a brand new Cluster E, assign a new UUID
  ② Use this text as the first member and initial center of Cluster E
  ③ Trigger "New Problem Discovery" notification → Can be transferred to manual processing or consolidated as a new type

6.3 Limitations and Improvement Directions

Limitation Description Improvement Ideas
TF-IDF ignores semantics Based on lexical matching, synonyms cannot be identified Use pre-trained Embeddings (like sentence-transformers) instead of TF-IDF
Hierarchical clustering O(n²) complexity Slow for large-scale data (tens of thousands and above) Use approximate methods (like MinHash LSH) for coarse screening first, then perform precise clustering on candidate sets
Threshold requires manual setting Optimal threshold differs for different data distributions Introduce adaptive thresholds or statistically-based methods for automatic determination
Cluster center drift Online updates may cause center shift Set freezing conditions or periodically re-cluster for correction

7. Summary

This article shared a solution for building a complete automatic text grouping system through the combination of TF-IDF vectorization + Ward hierarchical clustering + incremental prediction. If you have similar needs, feel free to discuss in the comments section.

Comments

Top 1 from juejin.cn, machine-translated. The original thread is authoritative.

用户560514818461

How is this data stored?