跪拜 Guibai
← Back to the summary

A Chinese Business Data Site Has Zero Anti-Scraping — Here's the Full Batch Extraction Code

Preface

Recently I was working on a small tool for batch collection of company information, originally intending to find a ready-made API. Before officially integrating the API, I casually searched for a few companies on the official website to test data quality.

After searching a few entries, I glanced at the URL — the structure was quite neat — so I used requests to pull one page and see. The result returned HTML directly: no CAPTCHA, no redirect, nothing blocked.

I added a few more layers later, still no block.

Since anti-scraping was essentially nonexistent, I figured I might as well see what data I could extract from the page.

@[toc]


1. What does the site look like? What can you find?

The site is here: Jinghai Data

<https://www.kqdaas.com/?utm_source=csdn&utm_medium=blog&utm_campaign=202606_brand_launch&utm_content=csdn_005>

The homepage has a very prominent search box, supporting input of company name, legal representative name, or product keywords.

The results return basic business registration information: company name, unified social credit code, legal representative, registered capital, establishment date, registered address, operating status, industry category, business scope — all these fields are present.

For scenarios like data cleaning, customer screening, and preliminary market analysis, these fields are already sufficient.

2. Scraping in practice: code runs directly

Since the site has no defenses, let's not be polite. Use requests + BeautifulSoup to write a simple scraper for batch collection of basic company information.

import csv
import json
import re
import time
from urllib.parse import quote

import requests

# Jinghai Data (kqdaas.com) uses a Next.js Server Action interface:
#   - Not a normal GET returning HTML, but POST /search with next-action header,
#     the keywords in the body are the actual search terms, returning RSC stream text.
#   - The real data is in the JSON on the line starting with "1:" (data.records).
#   - NEXT_ACTION / COOKIE (especially hh-token login state) come from browser capture; if expired, recapture and replace.
BASE_URL = "https://www.kqdaas.com/search"

# List of search keywords
KEYWORDS = ["科技", "信息", "数据"]

OUTPUT_CSV = "company_data.csv"

HEADERS = {
    "accept": "text/x-component",
    "content-type": "text/plain;charset=UTF-8",
    "origin": "https://www.kqdaas.com",
    "referer": BASE_URL,
    "next-action": "7f4db410c3cc3eabe9c3ae6dd4b83ade5bd1c26d8d",
    "next-router-state-tree": (
        "%5B%22%22%2C%7B%22children%22%3A%5B%22(local)%22%2C%7B%22children%22%3A%5B%22search%22"
        "%2C%7B%22children%22%3A%5B%22__PAGE__%22%2C%7B%7D%2Cnull%2Cnull%5D%7D%2Cnull%2Cnull%5D%7D"
        "%2Cnull%2Cnull%2Ctrue%5D"
    ),
    "user-agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/149.0.0.0 Safari/537.36"
    ),
    "Cookie": (
        "__bid_n=19eba682f8b6864d065fd9; "
        "hh-token=hanhai-local-32960d96fc2ee62271c544a21f95c3de; "
        "nb-referrer-hostname=www.kqdaas.com"
    ),
}


def parse_records(text):
    """Extract data.records from the RSC stream response.

    Each line of the response is shaped like <id>:<json>. Only one line contains JSON with data.records.
    Must split by \\n only; splitlines() would truncate JSON at Unicode line separators (\\x85 etc.) inside records.
    """
    for line in text.split("\n"):
        _, sep, payload = line.partition(":")
        if not sep or not payload.startswith("{"):
            continue
        try:
            data = json.loads(payload).get("data")
        except json.JSONDecodeError:
            continue
        if isinstance(data, dict) and "records" in data:
            return data
    return None


def search(keyword):
    """Search for a keyword, return data dict (with records / totalRecords), or None on failure."""
    # Server Action input structure is fixed; adding extra fields like pageIndex/pageSize causes the server to return data:null.
    # Must exactly match the captured body. The keyword in the URL is only for routing; the actual search term is in body.keywords.
    body = json.dumps([{
        "type": "filter",
        "keywords": keyword,
        "businessLocation": "$undefined",
        "companyIndustry": "$undefined",
        "companyType": "$undefined",
        "foundTime": "$undefined",
        "registCapital": "$undefined",
        "operateState": "$undefined",
        "contactType": "$undefined",
        "companyScale": "$undefined",
    }], ensure_ascii=False)

    resp = requests.post(f"{BASE_URL}?keyword={quote(keyword)}",
                         headers=HEADERS, data=body.encode("utf-8"), timeout=15)
    resp.raise_for_status()
    resp.encoding = "utf-8"  # requests may guess wrong encoding, causing garbled Chinese and JSON parse failure
    return parse_records(resp.text)


def main():
    results = []
    for keyword in KEYWORDS:
        try:
            data = search(keyword)
        except Exception as e:
            print(f"Keyword '{keyword}' error: {e}")
            continue
        if not data:
            print(f"Keyword '{keyword}' no data parsed; next-action / cookie may be expired, recapture needed")
            continue

        for r in data["records"]:
            results.append({
                "公司名称": re.sub(r"</?strong>", "", r.get("companyName", "")),  # remove highlight tags
                "统一社会信用代码": r.get("creditNumber", ""),
                "法定代表人": r.get("juridicalPerson", ""),
            })
        print(f"Keyword '{keyword}' scrape complete: {len(data['records'])} records this page, "
              f"total searchable: {data.get('totalRecords', 0)}")
        time.sleep(1)  # polite delay

    with open(OUTPUT_CSV, "w", newline="", encoding="utf-8-sig") as f:
        writer = csv.DictWriter(f, fieldnames=["公司名称", "统一社会信用代码", "法定代表人"])
        writer.writeheader()
        writer.writerows(results)
    print(f"Total scraped: {len(results)} records, written to {OUTPUT_CSV}")


if __name__ == '__main__':
    main()

Ran a large batch of data continuously, almost no restrictions encountered — this level of laxness is rare in today's enterprise data sites:

Of course, you'll need to check the specific selectors on the page yourself; different sites have different HTML structures. This is just a framework.

3. Two risk reminders

The above approach works, but a few points need to be noted.

One is timeliness. This nearly defenseless state likely won't last forever. One day traffic spikes, or an admin casually adds rate limiting, and it could be gone. If you have a data collection need right now, act early.

The other is compliance. Before scraping, it's best to check robots.txt, control request frequency (adding a time.sleep(1) in the code is easy), and only use it for lawful and compliant purposes. Respect data copyright; for large-scale commercial use, it's recommended to go through the official API.

4. Some data scraping can't get; you still need the API

If you need deeper dimensions — such as judicial risks, bidding records, intellectual property layout, qualification certificates, public opinion information — scraping can't get those; you still need the API.

I also tested Jinghai Data's API on the side. The 1,000 free credits given upon registration are universal across the entire platform, not a gimped version that only opens one or two endpoints. Friends with deep query needs can also claim the API quota.

The API dimensions are also quite broad; basically all data related to enterprise services is available. I cross-tested several dimensions, and accuracy and timeliness were decent.

The API follows standard RESTful style — just a GET request with header authentication, a few lines of Python code. Each endpoint also has an online debugging feature and detailed JSON response examples, making it easy to verify data structure, field completeness, and parameter logic.

For example, the JSON response example for the "Bidding Information List" endpoint:

That's about it. For friends with data needs in this area, while anti-scraping hasn't been implemented yet and the API has free credits, go grab a spot early. Portal:

<https://www.kqdaas.com/?utm_source=csdn&utm_medium=blog&utm_campaign=202606_brand_launch&utm_content=csdn_005>