How to Scrape FDA and BPOM Regulatory Filings with Proxies
Regulatory filings are among the most valuable data sources in pharmaceutical intelligence. The US Food and Drug Administration (FDA) and Indonesia’s Badan Pengawas Obat dan Makanan (BPOM) publish extensive databases of drug approvals, safety alerts, inspection reports, and regulatory actions that provide critical competitive intelligence.
Monitoring these regulatory databases manually is impractical for organizations tracking hundreds or thousands of products across multiple markets. Automated data collection using proxies enables systematic, real-time monitoring of regulatory activity that supports informed decision-making.
This guide provides practical instructions for scraping both FDA and BPOM regulatory databases using mobile proxies, with strategies applicable to other regulatory agencies across Southeast Asia.
Understanding the Regulatory Data Landscape
FDA Data Sources
The FDA publishes regulatory data across multiple platforms:
Drugs@FDA
- Approved drug products with labeling
- New Drug Applications (NDAs) and Abbreviated New Drug Applications (ANDAs)
- Approval history and supplemental approvals
Orange Book
- Lists of approved drug products with therapeutic equivalence evaluations
- Patent information and exclusivity data
- Critical for generic drug timing
FAERS (FDA Adverse Event Reporting System)
- Adverse event reports from healthcare professionals and consumers
- Quarterly data extracts available for download
- Valuable for pharmacovigilance
FDA Warning Letters
- Regulatory actions against companies for compliance violations
- Publicly searchable database
- Valuable for risk assessment and due diligence
FDA Inspections Database
- Inspection observations (Form 483s)
- Establishment inspection reports
- Compliance history by facility
openFDA API
- Structured API access to multiple FDA datasets
- Rate-limited but well-documented
- Covers drugs, devices, food, and more
BPOM Data Sources
BPOM, Indonesia’s food and drug regulatory agency, provides:
Product Registration Database (Cek BPOM)
- Registered pharmaceuticals, food products, cosmetics, and supplements
- Registration numbers and product details
- Search functionality for product verification
Safety Alerts and Recalls
- Public warnings about unsafe or unregistered products
- Product recall notices
- Regulatory enforcement actions
E-Registration System
- Online registration portal for product submissions
- Status tracking for pending registrations
BPOM Regulations and Guidelines
- Published regulatory guidance documents
- Updated requirements and standards
Other SEA Regulatory Sources
- HSA Singapore: Product registration, safety alerts, guidance documents
- Thai FDA: Drug registration database, safety notifications
- Philippine FDA: Product registration, post-market surveillance
- NPRA Malaysia: Product registration database, regulatory updates
- DAV Vietnam: Drug registration, regulatory guidance
Technical Challenges
FDA Scraping Challenges
- Rate limiting: FDA websites enforce request limits, especially on Drugs@FDA and FAERS
- Dynamic content: Some FDA interfaces use JavaScript rendering
- Anti-bot measures: Increasing use of bot detection on FDA web properties
- Data volume: FAERS alone contains millions of adverse event reports
- Format inconsistency: Different FDA databases use different data formats
BPOM Scraping Challenges
- Geo-restrictions: BPOM databases may perform differently when accessed from outside Indonesia
- Language: Content primarily in Bahasa Indonesia
- Interface changes: BPOM website undergoes periodic redesigns
- CAPTCHA protection: Search functions may require CAPTCHA solving
- Limited API access: No official public API for most databases
Setting Up Regulatory Filing Collection
Proxy Configuration
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import json
import re
class RegulatoryFilingScraper:
def __init__(self, proxy_user, proxy_pass):
self.proxies = {
"US": {
"http": f"http://{proxy_user}:{proxy_pass}@us-mobile.dataresearchtools.com:8080",
"https": f"http://{proxy_user}:{proxy_pass}@us-mobile.dataresearchtools.com:8080"
},
"ID": {
"http": f"http://{proxy_user}:{proxy_pass}@id-mobile.dataresearchtools.com:8080",
"https": f"http://{proxy_user}:{proxy_pass}@id-mobile.dataresearchtools.com:8080"
},
"SG": {
"http": f"http://{proxy_user}:{proxy_pass}@sg-mobile.dataresearchtools.com:8080",
"https": f"http://{proxy_user}:{proxy_pass}@sg-mobile.dataresearchtools.com:8080"
},
"TH": {
"http": f"http://{proxy_user}:{proxy_pass}@th-mobile.dataresearchtools.com:8080",
"https": f"http://{proxy_user}:{proxy_pass}@th-mobile.dataresearchtools.com:8080"
}
}
def get_mobile_headers(self, lang="en-US"):
return {
"User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-S918B) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Mobile Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": f"{lang},{lang.split('-')[0]};q=0.9"
}FDA Data Collection
Using openFDA API
class FDACollector:
def __init__(self, scraper):
self.scraper = scraper
self.openfda_base = "https://api.fda.gov"
def search_drug_approvals(self, drug_name, limit=100):
"""Search for drug approvals via openFDA API"""
results = []
skip = 0
while skip < limit:
try:
response = requests.get(
f"{self.openfda_base}/drug/drugsfda.json",
params={
"search": f'openfda.brand_name:"{drug_name}"',
"limit": min(100, limit - skip),
"skip": skip
},
proxies=self.scraper.proxies["US"],
timeout=30
)
if response.status_code == 200:
data = response.json()
batch = data.get("results", [])
if not batch:
break
results.extend(batch)
skip += len(batch)
elif response.status_code == 429:
time.sleep(10)
continue
else:
break
time.sleep(1)
except Exception as e:
print(f"Error querying openFDA: {e}")
break
return results
def search_adverse_events(self, drug_name, date_range=None):
"""Search FAERS for adverse event reports"""
search_query = f'patient.drug.openfda.brand_name:"{drug_name}"'
if date_range:
search_query += (
f' AND receivedate:[{date_range["start"]} TO '
f'{date_range["end"]}]'
)
try:
response = requests.get(
f"{self.openfda_base}/drug/event.json",
params={
"search": search_query,
"limit": 100
},
proxies=self.scraper.proxies["US"],
timeout=30
)
if response.status_code == 200:
return response.json().get("results", [])
except Exception as e:
print(f"Error querying FAERS: {e}")
return []
def get_warning_letters(self, company_name=None, date_from=None):
"""Collect FDA warning letters"""
params = {}
if company_name:
params["search"] = f'firm_name:"{company_name}"'
if date_from:
params["search"] = (
params.get("search", "") +
f' AND issue_date:[{date_from} TO *]'
)
params["limit"] = 100
try:
response = requests.get(
f"{self.openfda_base}/drug/enforcement.json",
params=params,
proxies=self.scraper.proxies["US"],
timeout=30
)
if response.status_code == 200:
return response.json().get("results", [])
except Exception as e:
print(f"Error fetching warning letters: {e}")
return []Scraping Drugs@FDA
def scrape_drugs_at_fda(self, application_number):
"""Scrape detailed drug information from Drugs@FDA"""
url = (f"https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm"
f"?event=overview.process&ApplNo={application_number}")
try:
response = requests.get(
url,
proxies=self.scraper.proxies["US"],
headers=self.scraper.get_mobile_headers(),
timeout=30
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
drug_info = {
"application_number": application_number,
"drug_name": self.extract_field(soup, "Drug Name"),
"active_ingredient": self.extract_field(
soup, "Active Ingredient"
),
"dosage_form": self.extract_field(soup, "Dosage Form/Route"),
"company": self.extract_field(soup, "Company"),
"approval_date": self.extract_field(soup, "Approval Date"),
"application_type": self.extract_field(
soup, "Application Type"
),
"collected_at": datetime.utcnow().isoformat()
}
# Get approval history
drug_info["approval_history"] = self.get_approval_history(
soup, application_number
)
return drug_info
except Exception as e:
print(f"Error scraping Drugs@FDA: {e}")
return None
def extract_field(self, soup, field_name):
"""Extract a labeled field from the page"""
label = soup.find(string=re.compile(field_name, re.IGNORECASE))
if label:
parent = label.find_parent("tr") or label.find_parent("div")
if parent:
cells = parent.find_all("td")
if len(cells) >= 2:
return cells[1].get_text(strip=True)
return NoneBPOM Data Collection
class BPOMCollector:
def __init__(self, scraper):
self.scraper = scraper
def search_registered_products(self, product_name, category="obat"):
"""Search BPOM registered product database"""
proxy = self.scraper.proxies["ID"]
try:
response = requests.get(
"https://cekbpom.pom.go.id/search",
params={
"query": product_name,
"kategori": category
},
proxies=proxy,
headers=self.scraper.get_mobile_headers("id-ID"),
timeout=30
)
if response.status_code == 200:
return self.parse_bpom_search_results(response.text)
except Exception as e:
print(f"Error searching BPOM: {e}")
return []
def parse_bpom_search_results(self, html):
"""Parse BPOM search results page"""
soup = BeautifulSoup(html, "html.parser")
results = []
product_items = soup.select(".product-item, .search-result-item")
for item in product_items:
registration_no = item.select_one(
".registration-number, .nomor-registrasi"
)
product_name = item.select_one(
".product-name, .nama-produk"
)
manufacturer = item.select_one(
".manufacturer, .produsen"
)
category = item.select_one(
".category, .kategori"
)
if product_name:
results.append({
"registration_number": (
registration_no.get_text(strip=True)
if registration_no else None
),
"product_name": product_name.get_text(strip=True),
"manufacturer": (
manufacturer.get_text(strip=True)
if manufacturer else None
),
"category": (
category.get_text(strip=True) if category else None
),
"source": "BPOM",
"country": "ID",
"collected_at": datetime.utcnow().isoformat()
})
return results
def get_product_details(self, registration_number):
"""Get detailed product information from BPOM"""
proxy = self.scraper.proxies["ID"]
try:
response = requests.get(
f"https://cekbpom.pom.go.id/product/{registration_number}",
proxies=proxy,
headers=self.scraper.get_mobile_headers("id-ID"),
timeout=30
)
if response.status_code == 200:
return self.parse_bpom_product_detail(response.text)
except Exception as e:
print(f"Error getting BPOM product details: {e}")
return None
def monitor_safety_alerts(self):
"""Monitor BPOM safety alerts and recalls"""
proxy = self.scraper.proxies["ID"]
try:
response = requests.get(
"https://www.pom.go.id/new/view/more/penarikan/obat",
proxies=proxy,
headers=self.scraper.get_mobile_headers("id-ID"),
timeout=30
)
if response.status_code == 200:
return self.parse_safety_alerts(response.text)
except Exception as e:
print(f"Error monitoring BPOM alerts: {e}")
return []Change Detection and Alerting
Monitoring for New Approvals
class ApprovalMonitor:
def __init__(self, fda_collector, bpom_collector, db):
self.fda = fda_collector
self.bpom = bpom_collector
self.db = db
def check_for_new_approvals(self, watched_drugs):
"""Check for new drug approvals across FDA and BPOM"""
alerts = []
for drug in watched_drugs:
# Check FDA
fda_results = self.fda.search_drug_approvals(drug["name"])
for result in fda_results:
if not self.db.is_known_approval("FDA", result):
alerts.append({
"agency": "FDA",
"drug": drug["name"],
"event": "new_approval",
"details": result,
"detected_at": datetime.utcnow().isoformat()
})
self.db.record_approval("FDA", result)
# Check BPOM
bpom_results = self.bpom.search_registered_products(drug["name"])
for result in bpom_results:
if not self.db.is_known_approval("BPOM", result):
alerts.append({
"agency": "BPOM",
"drug": drug["name"],
"event": "new_registration",
"details": result,
"detected_at": datetime.utcnow().isoformat()
})
self.db.record_approval("BPOM", result)
return alertsCross-Agency Comparison
def compare_registration_status(drug_name, collectors):
"""Compare drug registration status across agencies"""
status = {}
agencies = {
"FDA": collectors["fda"].search_drug_approvals,
"BPOM": collectors["bpom"].search_registered_products,
}
for agency_name, search_func in agencies.items():
results = search_func(drug_name)
status[agency_name] = {
"registered": len(results) > 0,
"registration_count": len(results),
"details": results[:5],
"checked_at": datetime.utcnow().isoformat()
}
return statusData Normalization
Standardize data from different agencies:
class RegulatoryDataNormalizer:
def normalize_fda_data(self, fda_record):
return {
"source_agency": "FDA",
"country": "US",
"product_name": fda_record.get("brand_name", ""),
"active_ingredient": fda_record.get("active_ingredient", ""),
"registration_type": fda_record.get("application_type", ""),
"registration_id": fda_record.get("application_number", ""),
"status": "approved",
"approval_date": fda_record.get("approval_date"),
"company": fda_record.get("sponsor_name", ""),
"normalized_at": datetime.utcnow().isoformat()
}
def normalize_bpom_data(self, bpom_record):
return {
"source_agency": "BPOM",
"country": "ID",
"product_name": bpom_record.get("product_name", ""),
"active_ingredient": bpom_record.get("kandungan", ""),
"registration_type": bpom_record.get("category", ""),
"registration_id": bpom_record.get("registration_number", ""),
"status": "registered",
"approval_date": bpom_record.get("registration_date"),
"company": bpom_record.get("manufacturer", ""),
"normalized_at": datetime.utcnow().isoformat()
}Best Practices
- Use the openFDA API first: For FDA data, start with the official API and supplement with web scraping only when the API is insufficient.
- Use Indonesian mobile proxies for BPOM: DataResearchTools mobile proxies in Indonesia provide the most reliable access to BPOM databases, which may serve different content to international visitors.
- Implement robust error handling: Regulatory websites experience downtime and changes. Build resilient scrapers that log errors and retry intelligently.
- Cache extensively: Regulatory data does not change once published. Cache approved drug records to reduce unnecessary requests.
- Monitor for website changes: Regulatory agencies periodically redesign their websites. Set up alerts when your parsers start returning unexpected results.
- Cross-reference sources: Validate data collected from one agency against other sources to ensure accuracy.
- Respect rate limits: Even with proxy rotation through DataResearchTools, implement reasonable delays. Sustainable access is more valuable than maximum speed.
Conclusion
Scraping FDA and BPOM regulatory filings provides essential intelligence for pharmaceutical companies operating across US and Indonesian markets. DataResearchTools mobile proxies in the US and Indonesia ensure reliable access to these critical databases, enabling automated monitoring of drug approvals, safety alerts, and regulatory actions.
By extending this approach to other SEA regulatory agencies using DataResearchTools mobile proxies in Singapore, Thailand, the Philippines, Malaysia, and Vietnam, you can build a comprehensive regulatory intelligence system covering all major markets in the region.
Start monitoring regulatory filings with DataResearchTools today and stay ahead of regulatory developments that affect your pharmaceutical business.
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- Best Proxies for Healthcare Data Collection in 2026
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Best Proxies for Government Data Scraping
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- Best Proxies for Healthcare Data Collection in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- Best Proxies for Healthcare Data Collection in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- Best Proxies for Healthcare Data Collection in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- Best Proxies for Healthcare Data Collection in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix