How to scrape Tiki Vietnam ecommerce

How to scrape Tiki Vietnam ecommerce

Scrape Tiki Vietnam and you tap into one of the three dominant ecommerce platforms in Vietnam, alongside Shopee Vietnam and Lazada Vietnam. Tiki was founded in 2010 and has built a strong reputation around fast 2-hour delivery (TikiNow) in major cities and a curated catalogue that leans into electronics, books, and household goods. The scraping landscape is shaped by three things: a publicly accessible JSON API that powers most of the site, Vietnamese-language content with Latin characters but heavy diacritics, and a moderate Cloudflare front end that profiles non-Vietnamese traffic.

This guide focuses on Tiki at tiki.vn as the canonical example. The patterns transfer to TikiNow city-specific catalogues with minor adjustments.

Mapping Tiki URL and JSON structure

Tiki product URLs follow the pattern https://tiki.vn/<product-slug>-p<productId>.html. The trailing productId (an integer) is the canonical SKU identifier. Behind every product page sits a JSON endpoint at https://tiki.vn/api/v2/products/<productId>. The endpoint accepts query parameters for store_id and platform and returns price, stock, full description, images, and seller information.

import httpx

API = "https://tiki.vn/api/v2/products"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
    "Accept": "application/json",
    "Accept-Language": "vi-VN,vi;q=0.9,en;q=0.8",
}

async def fetch_tiki(product_id: int, proxy: str):
    url = f"{API}/{product_id}"
    params = {"platform": "web", "spid": ""}
    async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
        r = await c.get(url, params=params)
        if r.status_code == 200:
            return r.json()
        return None

The response includes id, name, price, original_price, discount_rate, stock_item.qty, inventory_status, seller.id, seller.name, brand.name, categories, and rating_average. For most analytical use cases the API alone is sufficient.

Vietnamese proxy strategy

Tiki bot detection profiles visitor IP at the country level. For light scraping under 5,000 product reads per day, clean Asian datacenter IPs from Singapore or Tokyo work. For higher volumes, Vietnamese residential IPs through Viettel, VNPT, or FPT Telecom dramatically improve success rates and avoid Cloudflare interstitials.

For full catalogue sweeps, dedicated Vietnamese mobile inventory pays for itself. The cost differential against pan-Asian residential is meaningful but the success rate at scale is significantly higher.

Crawling the category tree

Tiki exposes a category tree at https://api.tiki.vn/raiden/v2/menu-config. Each category has a url_key and an id that you can use to query the listing API at https://tiki.vn/api/personalish/v1/blocks/listings. Pagination uses page and limit query parameters, with practical limits of 50 pages of 50 products each.

async def crawl_category(category_id: int, proxy_pool, max_pages: int = 50):
    results = []
    for page in range(1, max_pages + 1):
        proxy = proxy_pool.next()
        url = "https://tiki.vn/api/personalish/v1/blocks/listings"
        params = {
            "limit": 50,
            "page": page,
            "category": category_id,
            "aggregations": 2,
        }
        async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
            r = await c.get(url, params=params)
            if r.status_code != 200:
                break
            items = r.json().get("data", [])
            if not items:
                break
            results.extend(items)
    return results

For deeper coverage of large categories, decompose by brand or price band facets. The aggregations response includes the available facets and counts per facet.

Handling Vietnamese diacritics

Vietnamese product names use Latin characters with extensive diacritics. Always store the original UTF-8 text without normalizing diacritics, because diacritic stripping changes the meaning of many words and breaks brand matches. For full-text search, use Postgres with the unaccent extension or Elasticsearch with the Vietnamese analyzer, which both handle diacritic-aware matching correctly.

Cross-checking Tiki pricing against Shopee Vietnam

The most analytically interesting Vietnamese ecommerce signal is the price differential between Tiki and Shopee for the same SKU. They overlap heavily on electronics, beauty, and household goods. For brand monitoring, the price gap on a given SKU often signals which platform is running a flash promotion at any given time.

FieldTikiShopee VNNotes
Canonical IDproductId (int)itemid (int)Different schemes
EAN coverage~40%~30%Voluntary by seller
Update cadenceHourlySub-hourlyBoth update frequently
Promotion modelFlash sale + voucherFlash sale + coin + voucherShopee promotion stack is more complex

For SKU matching across platforms, group by EAN where available and by normalized title plus brand for the long tail.

Detecting and routing around CAPTCHA challenges on Tiki

When Tiki flags your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment.... Treat any of these as a soft block.

def is_challenged(response) -> bool:
    if response.status_code in (403, 503):
        return True
    if "cf-mitigated" in response.headers:
        return True
    if "__cf_chl_" in response.headers.get("set-cookie", ""):
        return True
    body = response.text[:2000].lower()
    return "just a moment" in body or "checking your browser" in body

When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser with a real Vietnam residential IP. The browser path costs more per page but solves the small percentage of challenges that the API path cannot handle.

Working with VND pricing and FX normalization

Pricing on Tiki is denominated in VND, and any cross-market analysis requires careful FX normalization. The naive approach of converting at scrape time using a live FX feed introduces noise into your trend lines because exchange rate movements get conflated with real price changes. Store the price in local VND and apply FX conversion at query time using a daily reference rate.

CREATE TABLE fx_rates (
    rate_date DATE NOT NULL,
    base_ccy VARCHAR(3) NOT NULL,
    quote_ccy VARCHAR(3) NOT NULL,
    rate DECIMAL(18,8) NOT NULL,
    PRIMARY KEY (rate_date, base_ccy, quote_ccy)
);

Source the daily rates from a reliable feed such as the European Central Bank reference rates or your bank wholesale feed. Avoid scraping retail FX rates because they include the bank spread.

Comparing Tiki to other regional marketplaces

MarketplaceCountry focusCatalogue scaleBot strictness
TikiVietnamLargeHigh
Shopee VietnamAdjacent marketsMediumMedium
Lazada VietnamAdjacent marketsSmallerLower

Cross-marketplace analyses help separate platform-specific dynamics from genuine market trends. If a price drops on Tiki but stays flat across the comparable competitors, that is a platform-driven event rather than a market-wide signal.

Operational monitoring and alerting

Every production scraper needs three monitoring layers. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.

import time
from collections import deque

class IPHealthTracker:
    def __init__(self, window_seconds: int = 300):
        self.window = window_seconds
        self.events = {}

    def record(self, ip: str, success: bool):
        bucket = self.events.setdefault(ip, deque())
        now = time.time()
        bucket.append((now, success))
        while bucket and bucket[0][0] < now - self.window:
            bucket.popleft()

    def success_rate(self, ip: str) -> float:
        bucket = self.events.get(ip)
        if not bucket:
            return 1.0
        successes = sum(1 for _, ok in bucket if ok)
        return successes / len(bucket)

Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails.

Legal and compliance considerations for Vietnam

Public product, price, and availability data are generally treated as fair to scrape in most jurisdictions, but Vietnam has its own consumer protection and personal data frameworks. Confine your collection to non-personal data: SKU identifiers, prices, descriptions, ratings as aggregates, and seller display names. Avoid collecting individual buyer reviews with names, phone numbers, or email addresses attached, and avoid pulling any data behind a login.

For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.

Pipeline orchestration and scheduling

For any non-trivial scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster.

from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def fetch_category(category_id: int, page: int):
    return crawl_one_page(category_id, page)

@flow(name="tiki-daily-sweep")
def daily_sweep(category_ids: list):
    futures = []
    for cid in category_ids:
        for page in range(1, 50):
            futures.append(fetch_category.submit(cid, page))
    return [f.result() for f in futures]

Run the flow on a 6-hour or 24-hour schedule depending on how dynamic the underlying catalogue is.

Sample analytics queries

-- Top 50 SKUs by price drop in the last 7 days
SELECT sku, MIN(selling_price) - MAX(selling_price) AS price_drop
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY sku
ORDER BY price_drop ASC
LIMIT 50;

-- Stock-out frequency per category
SELECT category_id,
       SUM(CASE WHEN in_stock = 0 THEN 1 ELSE 0 END)::float / COUNT(*) AS oos_rate
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY category_id
ORDER BY oos_rate DESC;

-- New SKUs first seen in the last 14 days
SELECT sku, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY sku
HAVING MIN(snapshot_at) > now() - interval '14 days';

These queries power most of the dashboards a category manager wants. Add a brand share view, a seller concentration view, and a campaign-frequency view and you have a competitive intelligence product.

Building robust deduplication across noisy listings

The long-tail catalogue is full of near-duplicate listings. The standard deduplication approach uses a three-pass funnel: exact match on EAN, normalized title plus brand TF-IDF similarity, then perceptual image hash similarity.

import imagehash
from PIL import Image

def perceptual_hash(image_path: str) -> str:
    img = Image.open(image_path)
    return str(imagehash.phash(img, hash_size=16))

Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge or under-merge.

Caching strategy and incremental crawls

Full daily snapshots scale linearly with catalogue size, which becomes expensive at multi-million SKU scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp using freshness deadline, volatility, and business priority signals to decide what to refetch on each cycle. Priority-driven scheduling reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the SKUs that actually matter to the business.

End-to-end pipeline architecture

A production-grade scraping pipeline has four layers that work together: collection, parsing, storage, and serving. Each layer has its own failure modes and its own scaling characteristics, and treating them as a single monolith is the most common architectural mistake teams make when scaling beyond hobby workloads.

The collection layer handles the network conversation: HTTP requests, proxy assignment, retry logic, and rate limit enforcement. It should know nothing about the data shape and nothing about how the data will eventually be queried. Its only job is to fetch raw bytes reliably and hand them off to the next layer with metadata about which IP, which user agent, and which timestamp produced them.

The parsing layer transforms raw bytes into structured records. It owns the schema, the field normalization, and the validation rules. When the upstream HTML or JSON structure changes, only the parsing layer needs to adapt. Keep parsers idempotent and version them aggressively so old raw bytes can be re-parsed when you discover bugs.

The storage layer holds the canonical snapshots in a query-optimized format. For most ecommerce datasets, a column-oriented store like DuckDB, ClickHouse, or BigQuery outperforms row-oriented Postgres at analytical scale. The trade-off is write latency and update support; column stores prefer append-only and bulk loads, which fits the snapshot model naturally.

The serving layer exposes the data to consumers, whether that is a BI dashboard, an API for downstream systems, or an alerting pipeline. Keep the serving layer denormalized and pre-aggregated where possible. Recomputing complex analytics on every dashboard load wastes resources and hurts responsiveness.

# Pseudo-code for the four-layer split
async def collect(url: str, proxy_pool) -> RawFetch:
    proxy = proxy_pool.next()
    response = await http_get(url, proxy)
    return RawFetch(url=url, body=response.text, fetched_at=now(), ip=proxy.ip)

def parse(raw: RawFetch) -> Snapshot:
    data = json.loads(raw.body)
    return Snapshot(sku=data["id"], price=data["price"], ...)

def store(snapshot: Snapshot, db) -> None:
    db.append("snapshots", snapshot)

def serve(query: str, db) -> list:
    return db.query(query)

Decoupling these layers also enables independent scaling. The collection layer is bound by proxy capacity and network bandwidth. The parsing layer is CPU-bound. The storage layer is bound by I/O and disk capacity. The serving layer is bound by query concurrency. Each layer can scale horizontally without coupling to the others.

Data quality monitoring patterns

Beyond per-IP success rate, every snapshot should pass a small battery of data quality checks before being considered authoritative. The checks fall into three categories: structural, distributional, and semantic.

Structural checks verify that every required field is present and of the expected type. A snapshot row missing the price field is not a real snapshot. A row with a negative price is not a real price.

Distributional checks compare the current snapshot against recent history. If today’s snapshot has 30% fewer SKUs than yesterday, something broke either in collection or in the upstream catalogue. Either way, the on-call engineer needs to investigate before downstream consumers see broken data.

Semantic checks compare related fields for consistency. If a SKU shows in_stock = true but stock_quantity = 0, one of the fields is wrong. If the discount percentage is computed from list_price and selling_price, the computed value should match the stated discount field.

def quality_check(snapshot: list[dict]) -> list[str]:
    errors = []
    if not snapshot:
        errors.append("empty snapshot")
        return errors
    avg_yesterday = get_yesterday_avg_size()
    if len(snapshot) < avg_yesterday * 0.7:
        errors.append(f"snapshot size {len(snapshot)} is 30% below yesterday")
    invalid = [r for r in snapshot if r.get("price", -1) < 0]
    if invalid:
        errors.append(f"{len(invalid)} rows have invalid price")
    return errors

Run quality checks as a separate flow that gates promotion of the snapshot from staging to production. A snapshot that fails quality checks should be quarantined for human review, not silently published.

Cost optimization strategies

Proxy bandwidth is usually the dominant cost in a production scraping operation. Three optimization patterns consistently reduce cost without hurting data quality. The first is request deduplication: if two consumers ask for the same SKU within the same hour, the system should serve the cached response rather than refetching. The second is conditional GET: when the upstream supports ETag or If-Modified-Since headers, conditional requests transfer no body when the resource has not changed. The third is selective field hydration: when the upstream API supports field selection, requesting only the fields you need reduces payload size dramatically.

For workloads above 100 GB of monthly proxy bandwidth, these three optimizations together reduce cost by 40-60% without changing the analytical output. The engineering effort to implement them is modest and the payback period is usually under a month at production volume.

Common pitfalls when scraping Tiki

Three issues recur on Tiki scrapers. The first is TikiNOW vs Marketplace separation. TikiNOW (2-hour delivery in HCMC and Hanoi) and Tiki Marketplace (multi-day fulfillment elsewhere) coexist on the same product page. The is_tikinow flag and seller_id distinguish them. For delivery-speed analytics, segment by both fields.

The second is Vietnamese diacritic normalization. Product titles include Vietnamese diacritics (e.g., điện thoại). Some pipelines strip diacritics for ASCII compatibility, which collapses distinct Vietnamese words into the same key and corrupts brand-level aggregates. Preserve diacritics end-to-end and only strip them at the final analytics layer if your downstream BI tool cannot handle UTF-8.

The third is VAT inclusion drift. Tiki shows VAT-inclusive prices on most SKUs but VAT-exclusive prices for some B2B-flagged SKUs. The vat_included field on the API response tells you which is which. Naive scrapers store the raw price and downstream analytics compare apples to oranges.

FAQ

Is the Tiki API officially documented?
The endpoints described here are the unauthenticated APIs that the public web site uses. They have been stable for several years but they are not contractually supported. Build defensively with schema drift alerts.

Can I scrape Tiki from Singapore or Hong Kong residential IPs?
Yes for light loads under 5,000 product reads per day. For sustained scraping at higher volumes, Vietnamese IPs are strongly preferred because they sustain higher request rates without challenges.

Does Tiki distinguish between Tiki Trading and third-party sellers?
Yes. The seller.id field identifies the seller; sellers with id 1 are Tiki’s own first-party Trading. Third-party sellers have their own IDs. For analytics, separating first-party from third-party listings is essential because the price dynamics and turnover characteristics differ significantly.

How does TikiNow city-specific delivery affect the data?
TikiNow availability depends on the destination postcode. The API returns a generic stock value but the actual same-day delivery promise is computed against the buyer location. For most analytical use cases, treat the stock value as the canonical inventory signal.

What about Vietnamese tax and VAT in pricing?
Tiki prices are displayed VAT-inclusive at 10%. The displayed price is what the buyer pays. For brand teams comparing against MAP policies set in net-of-tax terms, you need to back out the VAT to align with the brand reference price.

Does Tiki publish an official API?
Tiki Open Platform serves sellers, not analysts. For market-intelligence work, public-page scraping is the operational path. The endpoints under tiki.vn/api/v2/ are stable in practice.

How do I track Tiki’s flash sales accurately?
The flash_sale block carries start_time and end_time epochs. Sample at 5-15 minute intervals during the active window to capture stock burndown.

To build a broader Vietnam ecommerce intelligence stack, browse the ecommerce scraping category for tooling reviews, proxy comparisons, and framework deep dives.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)