A Junior Engineer's Alibaba Agent Interview: LangGraph4j, MCP, and When the Tables Turned

On Lao Wang's desk sat a bottle of Nongfu Spring and a bottle of C'estbon.

Before the interview started, he unscrewed the Nongfu Spring, took a sip, then unscrewed the C'estbon and took a sip. He then asked me, "Do you know why I'm drinking from two bottles at the same time?"

I was completely baffled.

Lao Wang laughed: "Because our department is working on Agents, so we're used to handling everything in parallel."

I couldn't hold it in. I really couldn't. It was the first time I'd met such a hilarious interviewer. 😄

Fine, I give this opening a perfect score.

Lao Wang glanced at my resume: "PaiAgent, LangGraph4j + Spring AI, workflow orchestration... and a RAG knowledge base project. Looks like you've seriously worked on Agents."

I said, "Brother Wang, of course. I researched this position in advance. This time, I must succeed; failure is not an option."

Ask about LangChain architecture
How many ways to develop an Agent
What are the core modules of LangChain
Introduce the multi-agent situation in your project
Talk about communication protocols between agents
What have you done with MCP
Do you understand Transformers
What is the self-attention mechanism
There are many large models on the market now, do you know how they compare? For example, which is the best for generating audio?
Which editor do you use
Do you understand the internal algorithms of models

Fasten your seatbelt, here we go~ The full text is quite intense, you can bookmark it to read (memorize) later.

content

01. Let's Talk About LangChain's Architecture

Lao Wang started directly: "Let's start with a basic one. Do you understand the LangChain architecture? How is it designed overall?"

It can be divided into three layers overall.

The bottom layer is the foundational abstraction layer, defining core interfaces like LLM, ChatModel, Prompt, and OutputParser. All upper-layer functions revolve around these interfaces; switching models only requires changing the implementation class.

The middle layer is the capability layer, including modules like Chain (chained calls), Agent (autonomous decision-making), Memory (conversation memory), and Retriever (retrieval). Chain is responsible for stringing multiple steps together, Agent is responsible for autonomously selecting tools based on goals, and Memory is responsible for maintaining context.

The top layer is the application layer: LangServe for deployment, LangSmith for debugging and monitoring, and LangGraph for complex state graph orchestration.

Lao Wang nodded continuously, looking very satisfied with my answer: "So what is the relationship between LangChain and LangGraph?"

I said: "LangChain's Chain is linear, A→B→C, going straight down one path. But real Agent scenarios often require conditional branching, loops, and parallel execution, which Chain can't handle."

"LangGraph can solve this problem. It upgrades the workflow from a chain to a graph—nodes process steps, edges can carry conditions, and it also supports loops. The LangGraph4j we use in our project is its Java version. The underlying layer is StateGraph, which uses state to drive the entire execution process."

Lao Wang followed up: "Why did you choose LangGraph4j instead of using the Python version directly?"

I said: "Because our entire tech stack is Java: Spring Boot + Spring AI. There aren't many choices for Agent orchestration in the Java ecosystem, and LangGraph4j is currently the most mature. Plus, it works well with Spring AI; ChatModel can be directly reused."

02. How Many Ways Are There to Develop an Agent?

Lao Wang took a sip of Nongfu Spring and continued questioning: "Not limited to LangChain, how many ways do you know to develop an Agent?"

"Four. I'll go through them one by one."

The first is the ReAct pattern. After receiving a task, the model alternates between Reasoning and Acting. It first figures out what to do in this step, then calls a specific tool to execute, takes the result back to think about the next step, and repeats until the task is complete. LangChain's Agent is this pattern.

The second is Plan-and-Execute. The model first plans out the entire execution plan in one go, then executes step by step according to the plan. The advantage is it's less likely to go down the wrong path. The disadvantage is that once the plan is set, it's hard to adjust.

The third is Multi-Agent collaboration. Multiple Agents each have their own responsibilities: one writes code, one reviews, one tests. They coordinate through message passing. AutoGen and CrewAI follow this approach.

The fourth is state graph orchestration, which is what LangGraph does. The developer draws the workflow as a graph, defines nodes, edges, and conditional branches, and the Agent executes along the path on the graph.

Lao Wang leaned back in his chair: "Your project chose the fourth type?"

"Yes, because our scenario isn't simple Q&A; it's a multi-step workflow—users drag and drop nodes to connect them on the front end, and the backend executes according to the graph. We needed a more controllable execution path."

03. What Are the Core Modules of LangChain?

(Inner OS: Luckily I did my homework beforehand, otherwise these first two questions would have finished me~~~)

Lao Wang listened very attentively and didn't interrupt me; he was really friendly.

After I finished answering, he continued asking: "LangChain's core modules, can you elaborate on them specifically?"

There are six in total.

Models: A unified wrapper for various large models. Whether it's OpenAI, DeepSeek, or Tongyi Qianwen, the upper-level call interface is ChatModel.call(). Although we didn't directly use LangChain in our project, Spring AI's ChatModel follows the same idea.

Prompts: Prompt management. Supports template variables, Few-Shot examples, and dynamic assembly. The PromptTemplateService in our PaiAgent does this job, supporting double curly brace {{variable}} syntax for variable substitution.

Indexes: Document indexing, mainly used for RAG. Includes DocumentLoader, TextSplitter, VectorStore, and Retriever. Our PaiCongming project used Elasticsearch + Alibaba Embedding for hybrid retrieval: BM25 keyword + KNN vector recall.

Memory: Memory management. Short-term memory uses BufferMemory to directly store the most recent rounds of conversation; long-term memory can connect to VectorStore for semantic retrieval.

Chains: Stringing multiple steps together. The most commonly used LLMChain is a three-step series of Prompt + Model + OutputParser.

Agents: The autonomous decision-making module. The model chooses which tool to call and what parameters to pass based on the current goal. This is the core module that distinguishes LangChain from ordinary LLM calls.

Lao Wang's face showed pleasure; he seemed to approve of my answer.

So he continued asking: "Of these six modules, how many did you use in your project?"

"All of them except Chains." I paused, "But we didn't use LangChain's implementation; we used Spring AI + self-developed. We didn't use Chains because we directly adopted LangGraph4j's StateGraph, which is more flexible than Chain."

04. Introduce the Multi-Agent Situation in Your Project

Lao Wang clearly had deep research into Agents, so he asked about multi-agents: "Is there Multi-Agent in your project?"

I smiled sheepishly: "Brother Wang, I'll be honest with you—strictly speaking, PaiAgent is not a traditional Multi-Agent system."

Traditional Multi-Agent involves multiple independent Agents, each with their own goals and memory, collaborating through message passing. For example, in AutoGen, a Coder Agent writes code, a Critic Agent reviews code, and they have back-and-forth interactions.

PaiAgent does workflow orchestration. Multiple nodes—LLM nodes, TTS nodes, Input/Output nodes—are connected via a directed graph and executed in topological order. Each node is not an independent Agent but a processing step within the workflow.

But we have one design that aligns with the Multi-Agent concept: EngineSelector dual-engine routing.

Simple linear workflows go through the DAG engine (topological sort + DFS cycle detection); complex workflows with conditional branches and loops go through the LangGraph engine. Both engines share the same set of NodeExecutors, adapted via NodeAdapter.

Lao Wang's eyes lit up: "You're very candid. So if you were to design a true Multi-Agent system, how would you do it?"

"Give each Agent its own independent StateGraph, and have Agents communicate through a message bus." I grew more confident as I spoke, "Each Agent subscribes to the message types it cares about, processes them, and publishes the results back to the bus. Low coupling; adding a new Agent doesn't affect the others."

Seeing that Lao Wang was in good spirits, I seized the chance to ask back: "Brother Wang, how is Multi-Agent architected in your internal Alibaba Agent projects? How do you handle state synchronization between Agents?"

Lao Wang froze for a moment, wiped the sweat from his forehead: "Uh... our team is still mainly focused on single Agent + tool invocation. Multi-Agent is still in the exploration phase."

(Inner OS: Hehehe, Lao Wang, I've got you now, 🤣.)

05. Talk About Communication Protocols Between Agents

Lao Wang coughed and quickly pulled the topic back: "So what communication protocols between Agents do you know about?"

The mainstream one currently is the A2A protocol proposed by Google.

Its core idea is to give each Agent a "capability card" (Agent Card), describing in JSON what the Agent can do, what inputs it accepts, and what outputs it returns. Agents communicate via standard HTTP APIs, using Tasks as collaboration units, supporting both synchronous and asynchronous modes.

A2A solves the problem of Agent interoperability across teams and organizations. For example, if an e-commerce platform's Order Agent and a logistics company's Delivery Agent need to collaborate, they might use different frameworks and different models, but as long as both follow the A2A protocol, they can communicate.

What's the difference between MCP and A2A?

MCP solves a slightly different problem; it mainly allows Agents to call external tools and services.

An MCP Server exposes its capabilities through a JSON Schema, and the Agent discovers and calls these capabilities through an MCP Client.

The difference between the two: A2A is Agent-to-Agent, MCP is Agent-to-Tool. One solves Agent collaboration, the other solves Agent capability expansion.

Lao Wang tilted his head: "So which one have you used in actual projects?"

"We haven't directly used A2A. But we implemented MCP in PaiCongming—we packaged three Servers: local file operations, PDF generation, and database queries."

Then I couldn't help but ask another question: "Brother Wang, what protocol do you use for communication between your internal Agents? Self-developed or open-source?"

Lao Wang wiped his sweat again: "We... use an internal RPC framework, and we're also looking at A2A."

06. What Have You Done with MCP?

(Inner OS: After the first five questions, it felt like Brother Wang was quite interested in my project. The rhythm was good.)

Lao Wang unscrewed the C'estbon and took another sip: "What specifically did you do with MCP?"

"In the PaiCongming RAG project, we packaged three MCP Servers."

The first was a File Operation Server. It encapsulated capabilities like reading/writing local files and directory traversal into MCP tools. When the Agent needed to read a user-uploaded document, it called it via MCP instead of directly operating the file system.

The second was a PDF Generation Server. When the Agent needed to generate a PDF report from analysis results, it called the MCP tool, passing in content and a template. The server side used iText to render it into a PDF and stored it in MinIO.

The third was a Database Query Server. When the Agent needed to query business data, it initiated an SQL query via MCP. On the server side, we also implemented SQL injection detection and query timeout limits.

The benefit of MCP is that it standardizes tool capabilities.

The Agent doesn't need to know whether the PDF is generated using iText or wkhtmltopdf; it just needs to know the MCP tool description, fill in the parameters, and call it. If the underlying implementation changes, the Agent side doesn't need to change a single line of code.

Lao Wang leaned forward: "How did you handle MCP Server registration and discovery?"

"Currently, it's static registration via configuration files. The MCP config specifies each Server's address and port." Can't panic, can't panic. At this moment, I was mainly bluffing, oh no, confident. "We didn't implement dynamic discovery. With only three Servers, static configuration is sufficient. If there are more later, we can connect to a registry center. Each Server reports its Agent Card upon startup, and the Agent pulls the list of available tools from the registry."

07. Do You Understand Transformers?

Lao Wang suddenly changed tack and asked: "Do you understand the Transformer architecture?"

"Of course, I must." I sat up straighter. "Google's 2017 paper 'Attention Is All You Need'. Basically, all large models today are based on it or its variants."

The overall structure is divided into two parts: Encoder and Decoder.

The Encoder is responsible for understanding the input, encoding the text into a set of vector representations. The Decoder is responsible for generating the output, emitting one token at a time.

Before Transformer, people used RNNs and LSTMs. The biggest pain point was that information decays when processing long sequences, and they couldn't be parallelized—they had to process words sequentially, one by one.

Transformer replaced the recurrent structure with the Self-Attention mechanism. Each position can directly "see" all other positions in the sequence, and it can be computed in parallel. This is the foundation for its ability to handle contexts of tens of thousands or even hundreds of thousands of tokens.

Lao Wang was quite interested in this area and continued asking: "Besides Attention, what other key components are in a Transformer?"

"Three."

Positional Encoding: Attention itself doesn't know the order of words. Positional encoding adds position information to each token. The original paper used sine and cosine functions; many models now use learnable positional encodings, or RoPE (Rotary Position Embedding).

Layer Normalization: A LayerNorm is added after each sub-layer to stabilize the training process.

Feed-Forward Network: A two-layer fully connected network follows each Attention layer for non-linear transformation. Actually, most of the model's parameters are concentrated here; the parameter proportion of the Attention layer isn't that large.

08. What is the Self-Attention Mechanism?

Lao Wang didn't give me a chance to catch my breath and immediately followed up: "How is Self-Attention specifically calculated?"

(Inner OS: He's going to strip down Transformer from head to toe for me.)

"The core is three matrices: Query, Key, Value."

The input sequence undergoes three different linear transformations to get the Q, K, V vector sets. Then, the dot product of Q and K is calculated for similarity, divided by √d (where d is the vector dimension, to prevent the value from being too large), passed through a Softmax to get the attention weights, and finally, a weighted sum of V is calculated using the weights.

Attention(Q, K, V) = softmax(Q·K^T / √d) · V

For an analogy, imagine we're reading an article titled "Why Can Sows Climb Trees?". When we read the word "it", our brain automatically looks for what "it" refers to—that sow. Self-Attention does exactly this: each token "looks" at all other tokens in the sequence, calculates which ones it's most closely related to, and then aggregates the relevant information.

Lao Wang nodded and dug deeper: "What about Multi-Head Attention?"

"A single Head can only capture one type of association pattern." I gestured in two directions with my hand. "For example, one Head focuses on grammatical relationships—subject-verb-object, while another Head focuses on semantic relationships—synonym substitution."

Multi-Head runs multiple sets of Attention in parallel, each using different Q, K, V projection matrices, and finally concatenates the outputs of all Heads.

This allows the model to simultaneously understand the relationships between tokens from multiple dimensions, making it much more expressive than a single Attention.

09. Comparison of Large Models

Lao Wang stretched and adopted a more relaxed posture: "There are so many large models now. Do you have any understanding of how they compare? For example, which is the best for audio generation?"

"I've actually researched this." I perked up. "In PaiAgent, we integrated several models and gained experience by stepping on landmines."

Let's start with text generation.

Currently, the Claude series is the strongest overall. Among domestic models, DeepSeek V3 offers the best price-performance ratio. Tongyi Qianwen Qwen3 performs well in Chinese scenarios, especially for long-text understanding. GLM-5.1's programming capability, especially for long tasks, is still top-tier among domestic models.

For code generation, Claude Opus and GPT-5.4 are in the first tier.

For audio generation, in the TTS field, I've tested several. Alibaba Bailian's qwen-tts series has relatively natural voice quality; we use qwen3-tts-flash in PaiAgent.

For image generation, Nano Banana 2 is the benchmark. Among domestic ones, Tongyi Wanxiang and Jimeng (ByteDance) are improving rapidly.

10. Which AI Coding Tool Do You Use?

Lao Wang suddenly changed the subject: "What AI coding tool do you use for daily development?"

"Previously, my main tool was IDEA. Now, my main tools are Claude Code + Codex."

"For reading code, modifying code, running tests, and checking logs, I mainly use Codex; it handles a large volume well." I thought for a moment and added, "If I need to investigate a solution, or when Codex can't solve it, I switch to Claude Code, currently configured with Opus 4.6."

But I haven't completely abandoned IDEA; the debugger and code navigation are still irreplaceable. Now I use both together: Claude Code for writing code, IDEA for debugging.

11. Do You Understand the Internal Algorithms of Models?

(Inner OS: An hour has passed. Lao Wang's stamina is really good; he can still ask more. I'm almost at my limit.)

Lao Wang looked at the time and made a final sprinting gesture: "Last one, do you understand the internal training algorithms of models?"

"I know the general idea, but not in great depth." I told the truth.

Large model training is roughly divided into three stages.

The first stage is Pre-training. It uses massive amounts of text data for self-supervised learning, with the goal of predicting the next token. This stage consumes the most computing power, often running thousands of GPUs for months. The model mainly learns the basic structure of language and world knowledge during this stage.

The second stage is SFT (Supervised Fine-Tuning). It uses human-annotated instruction-answer pairs for supervised fine-tuning, teaching the model to follow instructions. This stage transforms a model that "can talk but misses the point" into an assistant that "can understand instructions and give useful answers."

The third stage is RLHF (Reinforcement Learning from Human Feedback). First, a Reward Model is trained using human preference data to tell it what a good answer looks like. Then, reinforcement learning (PPO algorithm) is used to steer the main model's answers towards higher scores from the reward model. This stage makes the model's output more aligned with human expectations.

Lao Wang stared at me: "What's the difference between RLHF and DPO?"

"RLHF requires separately training a reward model and then using PPO for reinforcement learning; the process is quite heavy." I made a chopping motion with my hand. "DPO cuts out the reward model step entirely, using preference data pairs to optimize the policy model directly, combining two steps into one. Training is simpler, more stable, and the results are not bad either."

Lao Wang was silent for a few seconds. It looked like the rain was finally about to stop.

"Okay, do you have any questions for me?"

I had been waiting for this sentence.

"Brother Wang, you said your department is working on Agent projects. So I want to ask—in areas like LangChain, Multi-Agent, and A2A, how are you doing it specifically?"

Lao Wang nearly spat out his water and started wiping his sweat again: "We... um... mainly use... Multi-Agent is still in the POC stage, and we indeed haven't integrated A2A yet..."

I smiled.

Lao Wang also smiled, screwed the caps back on both bottles: "When can you start work?"

"Next Monday. I'll go back and treat my roommates to a meal 🎉"