How to scrape Rakuten Japan in 2026
Scrape Rakuten Japan and you encounter a marketplace shape that does not exist anywhere else in the world. Rakuten Ichiba is a federation of more than 50,000 individual seller storefronts hosted under the Rakuten umbrella, each with its own brand, layout, and pricing logic. Unlike Amazon Japan where every listing is normalized into a single product page, Rakuten lets every merchant publish their own product page for the same item, which means the same EAN can have hundreds of listings across hundreds of sellers. The scraping landscape is shaped by three things: the official Rakuten Ichiba Item API which is generous but rate-limited, the per-merchant storefronts which require HTML scraping for the long tail, and a Japanese-language catalogue that requires careful encoding handling.
This guide focuses on Rakuten Ichiba (the marketplace), which is the dominant property under rakuten.co.jp. Rakuten Travel, Rakuten Books, and Rakuten Mobile use related but distinct APIs.
The official Rakuten Ichiba Item API
Rakuten publishes a free public API at https://app.rakuten.co.jp/services/api/IchibaItem/Search/20220601 that searches the entire Ichiba catalogue. Registration is required to obtain an applicationId, which serves as the rate-limit token. The free tier allows roughly 1 request per second per applicationId, which sounds restrictive but is enough for substantial workloads if you parallelize across multiple registered application IDs.
import httpx
import asyncio
API_URL = "https://app.rakuten.co.jp/services/api/IchibaItem/Search/20220601"
async def search_ichiba(keyword: str, app_id: str, page: int = 1, hits: int = 30):
params = {
"applicationId": app_id,
"keyword": keyword,
"page": page,
"hits": hits,
"sort": "-updateTimestamp",
"format": "json",
}
async with httpx.AsyncClient(timeout=20) as c:
r = await c.get(API_URL, params=params)
if r.status_code == 200:
return r.json().get("Items", [])
return []
The API returns up to 100 hits per request, paginated to 100 pages, for a maximum of 10,000 hits per query. To exceed that, decompose the query by genre, price band, or shop. The genreId parameter accepts the Rakuten genre tree which is published separately at the IchibaGenre Search API.
Each item response includes itemCode (the canonical Rakuten SKU as <shopcode>:<itemnumber>), itemName, itemPrice in JPY, shopCode, shopName, reviewCount, reviewAverage, imageFlag, and a list of small/medium/large image URLs. For most analytical use cases, the API alone is sufficient and you do not need to scrape HTML.
When you still need HTML scraping
The official API does not expose seller storefront banners, custom seller categorizations, or the freeform descriptions that Rakuten merchants write on their own pages. For brand monitoring use cases that need to verify how a specific seller is presenting a product, you have to scrape the merchant storefront HTML.
Storefront URLs look like https://item.rakuten.co.jp/<shopcode>/<itemnumber>/. The HTML is encoded as Shift-JIS or UTF-8 depending on the merchant template, so always read the Content-Type header and decode accordingly. Encoding bugs are the most common silent failure in Rakuten scraping projects.
import httpx
from bs4 import BeautifulSoup
async def fetch_storefront(shop_code: str, item_number: str, proxy: str):
url = f"https://item.rakuten.co.jp/{shop_code}/{item_number}/"
async with httpx.AsyncClient(proxy=proxy, timeout=20) as c:
r = await c.get(url)
if r.status_code != 200:
return None
ctype = r.headers.get("content-type", "").lower()
if "shift_jis" in ctype or "shift-jis" in ctype:
text = r.content.decode("shift_jis", errors="replace")
else:
text = r.text
soup = BeautifulSoup(text, "lxml")
return {
"title": soup.select_one("title").get_text(strip=True) if soup.select_one("title") else None,
"html_size": len(text),
}
Storefront scraping requires Japanese residential proxies. Rakuten’s CDN serves significantly slower paths to non-Japanese visitors, and merchant storefronts often refuse non-JP traffic outright if they have enabled the country lock setting in their seller dashboard.
Genre-based catalogue discovery
Rakuten’s genre tree is the most reliable way to enumerate the catalogue. The IchibaGenre Search API returns the genre hierarchy starting from a root node. By recursively walking the tree, you can map every leaf genre and then issue Item Search queries against each leaf to collect SKUs.
GENRE_API = "https://app.rakuten.co.jp/services/api/IchibaGenre/Search/20140222"
async def fetch_children(genre_id: int, app_id: str):
params = {"applicationId": app_id, "genreId": genre_id, "format": "json"}
async with httpx.AsyncClient(timeout=20) as c:
r = await c.get(GENRE_API, params=params)
if r.status_code == 200:
return r.json().get("children", [])
return []
The genre tree has roughly 50,000 leaf nodes. Walking it once takes about 14 hours at 1 request per second. Cache the result in a database and refresh quarterly. The structure changes slowly enough that quarterly is sufficient for most use cases.
Multi-tenant pricing and seller analytics
Because the same EAN can be sold by multiple Rakuten merchants, the same physical product often appears at very different prices across the marketplace. For brand owners, this matters because it surfaces parallel imports, gray market listings, and unauthorized resellers. For competitive intelligence, it surfaces the price floor that the most aggressive seller is willing to defend.
The pattern is to group items by EAN (when published in the API response) or by a fuzzy match on itemName and merchant brand. Grouping by EAN is cleaner but only works for SKUs where the merchant published it. For the long tail, fuzzy match by normalized title and brand is the practical fallback.
| Field | Source | Reliability |
|---|---|---|
| itemCode | API | Canonical SKU per merchant |
| EAN/JAN | API attributes | About 60% coverage |
| itemName | API | Always present, often noisy |
| itemPrice | API | Authoritative |
| shopCode | API | Canonical merchant ID |
For grouping across merchants, build a normalization pass that strips merchant tags from the title (sellers add [Free shipping] and [Same day ship] markers liberally), normalizes character widths, and removes common Japanese boilerplate.
Detecting and routing around CAPTCHA challenges on Rakuten
When Rakuten flags your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment.... Treat any of these as a soft block.
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet.
For pages that absolutely must be fetched, have a fallback path that uses a headless browser with a real Japan residential IP. The browser path costs more per page but solves the small percentage of challenges that the API path cannot handle. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Working with JPY pricing and FX normalization
Pricing in Japan is denominated in JPY, and any cross-market analysis requires careful FX normalization. The naive approach of converting at scrape time using a live FX feed introduces noise into your trend lines because exchange rate movements get conflated with real price changes. Store the price in local JPY and apply FX conversion at query time using a daily reference rate.
CREATE TABLE fx_rates (
rate_date DATE NOT NULL,
base_ccy VARCHAR(3) NOT NULL,
quote_ccy VARCHAR(3) NOT NULL,
rate DECIMAL(18,8) NOT NULL,
PRIMARY KEY (rate_date, base_ccy, quote_ccy)
);
Source the daily rates from a reliable feed such as the European Central Bank reference rates or your bank’s wholesale feed. Avoid scraping retail FX rates because they include the bank’s spread and produce inconsistent comparisons.
Comparing Rakuten to other regional marketplaces
| Marketplace | Country focus | Catalogue scale | Bot strictness |
|---|---|---|---|
| Rakuten Ichiba | Japan | Large | High |
| Amazon Japan | Adjacent markets | Medium | Medium |
| Yahoo Shopping | Adjacent markets | Smaller | Lower |
Cross-marketplace analyses help separate platform-specific dynamics from genuine market trends. If a price drops on Rakuten Ichiba but stays flat across the comparable competitors, that is a platform-driven event rather than a market-wide signal. Your scraping pipeline should ingest from at least three platforms in any market where you intend to publish category insights.
Operational monitoring and alerting
Every production scraper needs three monitoring layers regardless of target. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
successes = sum(1 for _, ok in bucket if ok)
return successes / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails.
Legal and compliance considerations for Japan
Public product, price, and availability data are generally treated as fair to scrape in most jurisdictions, but Japan has its own consumer protection and personal data frameworks. Confine your collection to non-personal data: SKU identifiers, prices, descriptions, ratings as aggregates, and seller display names. Avoid collecting individual buyer reviews with names, phone numbers, or email addresses attached, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. Most data protection regimes treat scraped public data more favorably when there is a clear lawful basis and the data is not used for direct marketing to identified individuals. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.
Pipeline orchestration and scheduling
For any non-trivial scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster. Both handle the patterns you need: DAG dependencies, retries, observability, secret management, and dynamic fan-out across IPs and categories.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_category(category_id: int, page: int):
return crawl_one_page(category_id, page)
@task
def store_pages(pages: list):
write_to_db(pages)
@flow(name="rakuten-daily-sweep")
def daily_sweep(category_ids: list):
futures = []
for cid in category_ids:
for page in range(1, 50):
futures.append(fetch_category.submit(cid, page))
pages = [f.result() for f in futures]
store_pages(pages)
Run the flow on a 6-hour or 24-hour schedule depending on how dynamic the underlying catalogue is. For seasonal markets like apparel, a 6-hour cadence catches the meaningful movements without driving up proxy costs unnecessarily.
Sample analytics queries on the collected dataset
Once your snapshots are landing reliably, the analytics layer is where the value materializes. A few queries that consistently come up across Rakuten datasets:
-- Top 50 SKUs by price drop in the last 7 days
SELECT sku, MIN(selling_price) - MAX(selling_price) AS price_drop
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY sku
ORDER BY price_drop ASC
LIMIT 50;
-- Stock-out frequency per category
SELECT category_id,
SUM(CASE WHEN in_stock = 0 THEN 1 ELSE 0 END)::float / COUNT(*) AS oos_rate
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY category_id
ORDER BY oos_rate DESC;
Add a brand share view, a seller concentration view, and a campaign-frequency view and you have a competitive intelligence product. The collection layer is the prerequisite; the analytics layer is where you create defensible value.
Building robust deduplication across noisy listings
When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks.
The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing. Each pass merges groups produced by the previous pass.
import hashlib
from PIL import Image
import imagehash
def perceptual_hash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=16))
def normalize_title(title: str) -> str:
title = title.lower()
for token in ["[free shipping]", "[same day]", "(new)", "*sale*"]:
title = title.replace(token, "")
return " ".join(title.split())
Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups). Both failure modes break downstream analytics.
Versioning your scraper for catalogue evolution
Every ecommerce site evolves its catalogue structure regularly. New attribute fields appear, old fields are deprecated, category trees are reorganized, and pricing display logic changes. Your scraper code has to evolve with these changes, and a versioning pattern that keeps old data interpretable is critical.
The pattern that works best is to stamp every snapshot row with the scraper version that produced it. When you deploy a new version of the parser, increment the version number. Downstream analytics can filter by version when they need consistent semantics across a time range, or join across versions when they want long-running trend analysis.
ALTER TABLE snapshot ADD COLUMN scraper_version VARCHAR(16);
CREATE INDEX scraper_version_idx ON snapshot(scraper_version);
Pair this with a small registry table that documents what each scraper version did differently. When a downstream user asks why a particular metric jumped on a specific date, the version registry usually has the answer.
Caching strategy and incremental crawls
Full daily snapshots scale linearly with catalogue size, which becomes expensive at multi-million SKU scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp. The pattern uses three signals to decide what to refetch on each cycle.
The first signal is freshness deadline. Every SKU has a maximum staleness budget (say 24 hours), and any SKU older than its budget gets refreshed.
The second signal is volatility. SKUs that have changed price recently get higher refresh priority because they are more likely to change again. SKUs that have been stable for weeks can drop to a longer refresh interval.
The third signal is business priority. SKUs that downstream users actually query (tracked by query logs) get higher refresh priority than dormant SKUs that nobody has looked at in months.
def schedule_refresh(sku_id: int, last_changed_at, last_queried_at, last_fetched_at) -> int:
"""returns priority score; higher = refresh sooner"""
age = (now - last_fetched_at).total_seconds() / 3600
volatility = 10 if (now - last_changed_at).days < 7 else 1
relevance = 5 if (now - last_queried_at).days < 1 else 1
return age * volatility * relevance
This kind of priority-driven scheduler reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the SKUs that actually matter to the business.
Common pitfalls when scraping Rakuten Ichiba
Three issues dominate Rakuten Ichiba scraping. The first is shop-vs-product attribution. Rakuten is fundamentally a marketplace of independent shops, each with its own URL pattern, stock policy, and shipping terms. The same JAN code can appear in 50+ shops at different prices. Aggregating to the JAN level without preserving shop_id loses the price-dispersion signal that makes Rakuten data interesting.
The second is point-multiplier inflation. Rakuten campaigns layer point bonuses (5x, 10x, even 20x on Super Sale days). The visible price is in JPY but the effective price after points can be 10-15% lower for SPU members. A scraper that ignores the point multiplier field misreports the realized cost to the buyer. Capture both price and point_multiplier and compute the effective price downstream.
The third is character encoding edge cases. Some legacy shops still serve Shift-JIS or EUC-JP rather than UTF-8. Modern Rakuten infrastructure normalizes most pages to UTF-8 but shop-hosted product descriptions can leak through with the original encoding. Detect encoding per response rather than assuming UTF-8 globally.
FAQ
Do I need a Japanese IP to use the official API?
No. The IchibaItem Search API is accessible from anywhere in the world with a valid applicationId. Only the merchant storefront HTML scraping requires Japanese IP addresses for reliable access.
How many applicationIds can I register?
Rakuten allows multiple applicationIds per developer account. The terms of service require each applicationId to correspond to a real application, and aggressive parallelization across many IDs can trigger account-level throttling. For most production use, 3 to 5 IDs is sufficient and avoids the account-level scrutiny.
Does the API include stock counts?
No. The IchibaItem response includes an availability flag but not numerical stock. For stock-level data you have to scrape the storefront HTML, where merchants sometimes publish real-time stock indicators.
What about Rakuten Books and Rakuten Travel?
Rakuten publishes separate APIs for each major property: BooksTotal Search, TravelHotel Search, and so on. They follow the same registration model and similar rate limits. The applicationId is shared across all Rakuten APIs, so a single registration unlocks the whole family.
Can I scrape Rakuten reviews via the API?
The official API does not expose review text, only aggregate scores and counts. To collect review text you have to scrape the storefront HTML, which raises personal data considerations because reviewer display names are visible. Limit your collection to anonymized aggregates or work with explicit consent if reviewer identity matters.
Is the official Rakuten Ichiba API still useful in 2026?
Yes for catalog discovery and basic price data, but rate limits make it impractical for minute-level monitoring. Most production scrapers blend the official API for breadth with targeted scraping for depth.
How do I track Rakuten Super Sale price drops accurately?
Take baseline snapshots 14 days before the event and compare against intra-event hourly snapshots. Many sellers raise prices in the week before to inflate the apparent discount.
To pair this with a broader Asia ecommerce intelligence stack, browse the ecommerce scraping category for tooling reviews, proxy comparisons, and framework deep dives that pair with the patterns above.