How to scrape PChome Taiwan product data
Scrape PChome Taiwan and you are scraping the dominant general-merchandise marketplace in Taiwan, famous for its 24-hour delivery service across the island and operated by PChome Online Inc. The platform is the largest player in Taiwanese B2C ecommerce by GMV, ahead of Momoshop and Yahoo Shopping Taiwan, and it has a relatively scrape-friendly architecture for a marketplace of its size. The scraping landscape is shaped by three things: a publicly available JSON API that powers the front end, a category structure that maps cleanly to URL paths, and a Cloudflare front end that profiles non-Taiwanese traffic less aggressively than most Asian marketplaces.
This guide covers PChome 24h (the express-delivery property at 24h.pchome.com.tw) and notes where the patterns extend to PChome Shopping Mall (shopping.pchome.com.tw).
Mapping PChome URL and JSON structure
PChome 24h product URLs follow the pattern https://24h.pchome.com.tw/prod/<productCode>. The productCode is a 12-character alphanumeric identifier that is the canonical SKU. Behind every product page sits a JSON endpoint at https://ecapi.pchome.com.tw/ecshop/prodapi/v2/prod/<productCode>. The endpoint returns price, stock, full description, images, and category path in a single response.
import httpx
API = "https://ecapi.pchome.com.tw/ecshop/prodapi/v2/prod"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "application/json",
"Accept-Language": "zh-TW,zh;q=0.9",
}
async def fetch_pchome(product_code: str, proxy: str):
url = f"{API}/{product_code}"
async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
r = await c.get(url)
if r.status_code == 200:
return r.json()
return None
The response includes Id (productCode), Name, Brand, Price.P (current price in TWD), Price.M (member price), Stock, Slogan, Description, Pic (image filenames), and Cate (category path as a list). For most analytical use cases the API is sufficient and you do not need to render the HTML.
Taiwanese proxy strategy
PChome’s bot detection is moderate compared to other regional marketplaces. The site profiles visitor IP at the country level but does not aggressively block non-Taiwan traffic for the API endpoints. For light scraping (under 5,000 product reads per day), a clean datacenter IP from a Tokyo or Singapore region works. For higher volumes, Taiwanese residential or mobile IPs through Chunghwa Telecom or Far EasTone produce dramatically better success rates and avoid the occasional Cloudflare interstitial.
For full catalogue sweeps that touch hundreds of thousands of SKUs per day, dedicated Taiwan inventory pays for itself. The cost differential against generic Asian residential is meaningful but the success rate differential at scale is larger.
Crawling the category tree
PChome exposes the category tree at https://ecapi.pchome.com.tw/cdn/ecshop/prodapi/v2/cateinfo/<rootCateId>/category. Categories are organized as a six-level deep hierarchy. The API returns child nodes with their own categoryId values that you can recursively walk to enumerate the full tree.
async def fetch_category(cate_id: str, proxy: str):
url = f"https://ecapi.pchome.com.tw/cdn/ecshop/prodapi/v2/cateinfo/{cate_id}/category"
async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
r = await c.get(url)
if r.status_code == 200:
return r.json()
return []
For each leaf category, the listing endpoint at https://ecapi.pchome.com.tw/ecshop/prodapi/v2/cateprod/<cateId>/prod returns the products in that category. Pagination uses a start and rows parameter, with practical limits of 1,000 products per category. For broader categories, decompose by brand or price band facet.
Working with Traditional Chinese text
PChome catalogue text is in Traditional Chinese. If your downstream pipeline expects Simplified Chinese, plan a conversion step using the OpenCC library. Conversion is generally lossless from Traditional to Simplified for product catalogue text, though edge cases (region-specific brand names, technical jargon) sometimes need a dictionary override.
from opencc import OpenCC
cc = OpenCC("t2s") # traditional to simplified
simplified_title = cc.convert(traditional_title)
For full-text search across the dataset, both Postgres and Elasticsearch handle Chinese text well with the right analyzer. Use the chinese text search configuration in Postgres or the IK analyzer in Elasticsearch. Avoid space-tokenized full-text search because Chinese has no word boundaries and naive tokenization fails badly.
Cross-checking PChome pricing against Momoshop
The most analytically interesting Taiwan ecommerce signal is the price differential between PChome and Momoshop, the two dominant marketplaces, for the same SKU. They overlap heavily on electronics, home appliances, and beauty, and the price gap on a given SKU often signals which platform is running a promotion at any given time.
To match SKUs across platforms, group by EAN or by normalized title plus brand. The match rate is roughly 60% on EAN and an additional 20% on title-plus-brand, leaving a long tail of platform-exclusive SKUs.
| Field | PChome 24h | Momoshop | Notes |
|---|---|---|---|
| Canonical ID | productCode (12 char) | i_code (numeric) | Different schemes |
| EAN coverage | ~50% | ~45% | Voluntary by seller |
| Update cadence | Hourly | Hourly | Both update prices throughout the day |
| Promotion model | Site-wide and member | Site-wide and category | Different promotion engines |
For brand monitoring use cases, build a master SKU table that joins PChome productCode and Momoshop i_code via EAN where available. For SKUs without EAN matches, run a daily fuzzy-matching job and surface the matches for human review.
Detecting and routing around CAPTCHA challenges on PChome
When PChome flags your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment.... Treat any of these as a soft block.
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser with a real Taiwan residential IP. The browser path costs more per page but solves the small percentage of challenges that the API path cannot handle. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Working with TWD pricing and FX normalization
Pricing on PChome is denominated in TWD, and any cross-market analysis requires careful FX normalization. The naive approach of converting at scrape time using a live FX feed introduces noise into your trend lines because exchange rate movements get conflated with real price changes. Store the price in local TWD and apply FX conversion at query time using a daily reference rate.
CREATE TABLE fx_rates (
rate_date DATE NOT NULL,
base_ccy VARCHAR(3) NOT NULL,
quote_ccy VARCHAR(3) NOT NULL,
rate DECIMAL(18,8) NOT NULL,
PRIMARY KEY (rate_date, base_ccy, quote_ccy)
);
Source the daily rates from a reliable feed such as the European Central Bank reference rates or your bank wholesale feed. Avoid scraping retail FX rates because they include the bank spread and produce inconsistent comparisons. For analyses that span multiple years, also account for currency revaluation events that occasionally happen in emerging markets.
Comparing PChome to other regional marketplaces
| Marketplace | Country focus | Catalogue scale | Bot strictness |
|---|---|---|---|
| PChome 24h | Taiwan | Large | High |
| Momoshop | Adjacent markets | Medium | Medium |
| Yahoo Shopping TW | Adjacent markets | Smaller | Lower |
Cross-marketplace analyses help separate platform-specific dynamics from genuine market trends. If a price drops on PChome 24h but stays flat across the comparable competitors, that is a platform-driven event rather than a market-wide signal. Your scraping pipeline should ingest from at least three platforms in any market where you intend to publish category insights.
Operational monitoring and alerting
Every production scraper needs three monitoring layers regardless of target. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
successes = sum(1 for _, ok in bucket if ok)
return successes / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails. For long-running operations against PChome, IP rotation triggered by the health tracker is more reliable than fixed rotation schedules.
Legal and compliance considerations for Taiwan
Public product, price, and availability data are generally treated as fair to scrape in most jurisdictions, but Taiwan has its own consumer protection and personal data frameworks. Confine your collection to non-personal data: SKU identifiers, prices, descriptions, ratings as aggregates, and seller display names. Avoid collecting individual buyer reviews with names, phone numbers, or email addresses attached, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. Most data protection regimes treat scraped public data more favorably when there is a clear lawful basis and the data is not used for direct marketing to identified individuals. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.
Pipeline orchestration and scheduling
For any non-trivial scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster. Both handle the patterns you need: DAG dependencies, retries, observability, secret management, and dynamic fan-out across IPs and categories.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_category(category_id: int, page: int):
return crawl_one_page(category_id, page)
@task
def store_pages(pages: list):
write_to_db(pages)
@flow(name="pchome-daily-sweep")
def daily_sweep(category_ids: list):
futures = []
for cid in category_ids:
for page in range(1, 50):
futures.append(fetch_category.submit(cid, page))
pages = [f.result() for f in futures]
store_pages(pages)
Run the flow on a 6-hour or 24-hour schedule depending on how dynamic the underlying catalogue is. For seasonal markets like apparel where pricing changes daily, a 6-hour cadence catches the meaningful movements without driving up proxy costs unnecessarily. For long-tail categories like books or industrial supplies, daily is sufficient and the cost saving is meaningful.
Sample analytics queries on the collected dataset
Once your snapshots are landing reliably, the analytics layer is where the value materializes. A few queries that consistently come up across PChome datasets:
-- Top 50 SKUs by price drop in the last 7 days
SELECT sku, MIN(selling_price) - MAX(selling_price) AS price_drop
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY sku
ORDER BY price_drop ASC
LIMIT 50;
-- Stock-out frequency per category
SELECT category_id,
SUM(CASE WHEN in_stock = 0 THEN 1 ELSE 0 END)::float / COUNT(*) AS oos_rate
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY category_id
ORDER BY oos_rate DESC;
-- New SKUs first seen in the last 14 days
SELECT sku, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY sku
HAVING MIN(snapshot_at) > now() - interval '14 days'
ORDER BY first_seen DESC;
These three queries alone power most of the dashboards a category manager wants. Add a brand share view, a seller concentration view, and a campaign-frequency view and you have a competitive intelligence product. The collection layer is the prerequisite; the analytics layer is where you create defensible value.
Building robust deduplication across noisy listings
When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks.
The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing.
import imagehash
from PIL import Image
def perceptual_hash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=16))
Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups).
Versioning your scraper for catalogue evolution
Every ecommerce site evolves its catalogue structure regularly. Stamp every snapshot row with the scraper version that produced it. When you deploy a new version of the parser, increment the version number. Downstream analytics can filter by version when they need consistent semantics across a time range, or join across versions when they want long-running trend analysis.
Building robust deduplication across noisy listings
When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks. The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing.
import imagehash
from PIL import Image
def perceptual_hash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=16))
Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups). Both failure modes break downstream analytics in subtle ways that take weeks to detect.
Caching strategy and incremental crawls
Full daily snapshots scale linearly with catalogue size, which becomes expensive at multi-million SKU scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp. The pattern uses three signals to decide what to refetch on each cycle. The first signal is freshness deadline. Every SKU has a maximum staleness budget, and any SKU older than its budget gets refreshed. The second signal is volatility. SKUs that have changed price recently get higher refresh priority. The third signal is business priority. SKUs that downstream users actually query get higher refresh priority than dormant SKUs that nobody has looked at in months.
This kind of priority-driven scheduler reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the SKUs that actually matter to the business.
Common pitfalls when scraping PChome 24h
Three issues dominate PChome scraping. The first is the prod ID vs SKU ID confusion. PChome uses a prod_id for the product page and a separate sku_id for variants like color and capacity. The same prod_id can have 3-10 sku_ids with different prices. Naive scrapers join on prod_id and report the cheapest variant as the product price, which understates the average sale price by 8-20% for electronics.
The second is 24-hour delivery vs marketplace mixing. PChome 24h’s flagship promise is Taiwan-wide same-day or next-day delivery, but the marketplace (PChome Mall) sits on the same domain with longer fulfillment. The is_24h flag separates them. Aggregating both into the same dataset distorts delivery-speed analytics.
The third is Traditional vs Simplified Chinese normalization. PChome stores product titles in Traditional Chinese. Cross-referencing with Mainland data sources requires a TC->SC conversion via OpenCC or similar. Encoding-only conversion (UTF-8 reading) does not produce a Mainland-readable string.
FAQ
Is the PChome API officially documented?
The ecapi.pchome.com.tw endpoints are the same endpoints used by the PChome web front end. They have been stable for several years but are not contractually supported as a public API. Build defensively with schema drift alerts.
Does PChome offer an affiliate API I should use instead?
PChome operates an affiliate program but the affiliate feed lacks real-time pricing and does not include the full catalogue. For competitive intelligence purposes, scraping the public API remains the higher-fidelity path.
Can I scrape PChome from China mainland IPs?
PChome does not specifically block China mainland IPs but the success rate is meaningfully lower because the cross-strait routing introduces high latency and connection instability. Taiwan or Hong Kong IPs are strongly preferred.
How does PChome handle the difference between PChome 24h and PChome Shopping Mall?
24h is the express-delivery property with a smaller curated catalogue and PChome’s own logistics. Shopping Mall is the larger marketplace with third-party sellers and longer delivery times. The APIs share patterns but use different endpoint paths and product code schemes. Plan for separate code paths if your project covers both.
Are Taiwan-specific holidays a meaningful factor for snapshot scheduling?
Taiwan’s major shopping events (Double 11, Double 12, Lunar New Year, 618) drive significant pricing and promotion activity. During these windows, increase your snapshot frequency to capture the rapid price changes. The default 24-hour cadence misses important intraday movements during peak promotional periods.
Does PChome block non-Taiwan IPs aggressively?
Casual lookups succeed from most regions. Sustained scraping above 100 requests per hour from non-TW IPs triggers rate limits. Taiwan residential IPs are the safe path for production.
How do I track PChome Double 11 promotions?
Capture daily baselines from October 1 onwards. PChome staggers promotions across two weeks leading into 11/11 with daily reveals, so a single pre-event baseline misses the early discount waves.
To build a broader Taiwan ecommerce intelligence stack, browse the ecommerce scraping category for tooling reviews, proxy comparisons, and framework deep dives that pair with the patterns above.