How to Scrape FDA and BPOM Regulatory Filings with Proxies

How to Scrape FDA and BPOM Regulatory Filings with Proxies

Regulatory filings are among the most valuable data sources in pharmaceutical intelligence. The US Food and Drug Administration (FDA) and Indonesia’s Badan Pengawas Obat dan Makanan (BPOM) publish extensive databases of drug approvals, safety alerts, inspection reports, and regulatory actions that provide critical competitive intelligence.

Monitoring these regulatory databases manually is impractical for organizations tracking hundreds or thousands of products across multiple markets. Automated data collection using proxies enables systematic, real-time monitoring of regulatory activity that supports informed decision-making.

This guide provides practical instructions for scraping both FDA and BPOM regulatory databases using mobile proxies, with strategies applicable to other regulatory agencies across Southeast Asia.

Understanding the Regulatory Data Landscape

FDA Data Sources

The FDA publishes regulatory data across multiple platforms:

Drugs@FDA

  • Approved drug products with labeling
  • New Drug Applications (NDAs) and Abbreviated New Drug Applications (ANDAs)
  • Approval history and supplemental approvals

Orange Book

  • Lists of approved drug products with therapeutic equivalence evaluations
  • Patent information and exclusivity data
  • Critical for generic drug timing

FAERS (FDA Adverse Event Reporting System)

  • Adverse event reports from healthcare professionals and consumers
  • Quarterly data extracts available for download
  • Valuable for pharmacovigilance

FDA Warning Letters

  • Regulatory actions against companies for compliance violations
  • Publicly searchable database
  • Valuable for risk assessment and due diligence

FDA Inspections Database

  • Inspection observations (Form 483s)
  • Establishment inspection reports
  • Compliance history by facility

openFDA API

  • Structured API access to multiple FDA datasets
  • Rate-limited but well-documented
  • Covers drugs, devices, food, and more

BPOM Data Sources

BPOM, Indonesia’s food and drug regulatory agency, provides:

Product Registration Database (Cek BPOM)

  • Registered pharmaceuticals, food products, cosmetics, and supplements
  • Registration numbers and product details
  • Search functionality for product verification

Safety Alerts and Recalls

  • Public warnings about unsafe or unregistered products
  • Product recall notices
  • Regulatory enforcement actions

E-Registration System

  • Online registration portal for product submissions
  • Status tracking for pending registrations

BPOM Regulations and Guidelines

  • Published regulatory guidance documents
  • Updated requirements and standards

Other SEA Regulatory Sources

  • HSA Singapore: Product registration, safety alerts, guidance documents
  • Thai FDA: Drug registration database, safety notifications
  • Philippine FDA: Product registration, post-market surveillance
  • NPRA Malaysia: Product registration database, regulatory updates
  • DAV Vietnam: Drug registration, regulatory guidance

Technical Challenges

FDA Scraping Challenges

  • Rate limiting: FDA websites enforce request limits, especially on Drugs@FDA and FAERS
  • Dynamic content: Some FDA interfaces use JavaScript rendering
  • Anti-bot measures: Increasing use of bot detection on FDA web properties
  • Data volume: FAERS alone contains millions of adverse event reports
  • Format inconsistency: Different FDA databases use different data formats

BPOM Scraping Challenges

  • Geo-restrictions: BPOM databases may perform differently when accessed from outside Indonesia
  • Language: Content primarily in Bahasa Indonesia
  • Interface changes: BPOM website undergoes periodic redesigns
  • CAPTCHA protection: Search functions may require CAPTCHA solving
  • Limited API access: No official public API for most databases

Setting Up Regulatory Filing Collection

Proxy Configuration

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import json
import re

class RegulatoryFilingScraper:
    def __init__(self, proxy_user, proxy_pass):
        self.proxies = {
            "US": {
                "http": f"http://{proxy_user}:{proxy_pass}@us-mobile.dataresearchtools.com:8080",
                "https": f"http://{proxy_user}:{proxy_pass}@us-mobile.dataresearchtools.com:8080"
            },
            "ID": {
                "http": f"http://{proxy_user}:{proxy_pass}@id-mobile.dataresearchtools.com:8080",
                "https": f"http://{proxy_user}:{proxy_pass}@id-mobile.dataresearchtools.com:8080"
            },
            "SG": {
                "http": f"http://{proxy_user}:{proxy_pass}@sg-mobile.dataresearchtools.com:8080",
                "https": f"http://{proxy_user}:{proxy_pass}@sg-mobile.dataresearchtools.com:8080"
            },
            "TH": {
                "http": f"http://{proxy_user}:{proxy_pass}@th-mobile.dataresearchtools.com:8080",
                "https": f"http://{proxy_user}:{proxy_pass}@th-mobile.dataresearchtools.com:8080"
            }
        }

    def get_mobile_headers(self, lang="en-US"):
        return {
            "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-S918B) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/120.0.0.0 Mobile Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
            "Accept-Language": f"{lang},{lang.split('-')[0]};q=0.9"
        }

FDA Data Collection

Using openFDA API

class FDACollector:
    def __init__(self, scraper):
        self.scraper = scraper
        self.openfda_base = "https://api.fda.gov"

    def search_drug_approvals(self, drug_name, limit=100):
        """Search for drug approvals via openFDA API"""
        results = []
        skip = 0

        while skip < limit:
            try:
                response = requests.get(
                    f"{self.openfda_base}/drug/drugsfda.json",
                    params={
                        "search": f'openfda.brand_name:"{drug_name}"',
                        "limit": min(100, limit - skip),
                        "skip": skip
                    },
                    proxies=self.scraper.proxies["US"],
                    timeout=30
                )

                if response.status_code == 200:
                    data = response.json()
                    batch = data.get("results", [])
                    if not batch:
                        break
                    results.extend(batch)
                    skip += len(batch)
                elif response.status_code == 429:
                    time.sleep(10)
                    continue
                else:
                    break

                time.sleep(1)
            except Exception as e:
                print(f"Error querying openFDA: {e}")
                break

        return results

    def search_adverse_events(self, drug_name, date_range=None):
        """Search FAERS for adverse event reports"""
        search_query = f'patient.drug.openfda.brand_name:"{drug_name}"'

        if date_range:
            search_query += (
                f' AND receivedate:[{date_range["start"]} TO '
                f'{date_range["end"]}]'
            )

        try:
            response = requests.get(
                f"{self.openfda_base}/drug/event.json",
                params={
                    "search": search_query,
                    "limit": 100
                },
                proxies=self.scraper.proxies["US"],
                timeout=30
            )

            if response.status_code == 200:
                return response.json().get("results", [])
        except Exception as e:
            print(f"Error querying FAERS: {e}")
        return []

    def get_warning_letters(self, company_name=None, date_from=None):
        """Collect FDA warning letters"""
        params = {}
        if company_name:
            params["search"] = f'firm_name:"{company_name}"'
        if date_from:
            params["search"] = (
                params.get("search", "") +
                f' AND issue_date:[{date_from} TO *]'
            )
        params["limit"] = 100

        try:
            response = requests.get(
                f"{self.openfda_base}/drug/enforcement.json",
                params=params,
                proxies=self.scraper.proxies["US"],
                timeout=30
            )
            if response.status_code == 200:
                return response.json().get("results", [])
        except Exception as e:
            print(f"Error fetching warning letters: {e}")
        return []

Scraping Drugs@FDA

def scrape_drugs_at_fda(self, application_number):
    """Scrape detailed drug information from Drugs@FDA"""
    url = (f"https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm"
           f"?event=overview.process&ApplNo={application_number}")

    try:
        response = requests.get(
            url,
            proxies=self.scraper.proxies["US"],
            headers=self.scraper.get_mobile_headers(),
            timeout=30
        )

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")

            drug_info = {
                "application_number": application_number,
                "drug_name": self.extract_field(soup, "Drug Name"),
                "active_ingredient": self.extract_field(
                    soup, "Active Ingredient"
                ),
                "dosage_form": self.extract_field(soup, "Dosage Form/Route"),
                "company": self.extract_field(soup, "Company"),
                "approval_date": self.extract_field(soup, "Approval Date"),
                "application_type": self.extract_field(
                    soup, "Application Type"
                ),
                "collected_at": datetime.utcnow().isoformat()
            }

            # Get approval history
            drug_info["approval_history"] = self.get_approval_history(
                soup, application_number
            )

            return drug_info
    except Exception as e:
        print(f"Error scraping Drugs@FDA: {e}")
    return None

def extract_field(self, soup, field_name):
    """Extract a labeled field from the page"""
    label = soup.find(string=re.compile(field_name, re.IGNORECASE))
    if label:
        parent = label.find_parent("tr") or label.find_parent("div")
        if parent:
            cells = parent.find_all("td")
            if len(cells) >= 2:
                return cells[1].get_text(strip=True)
    return None

BPOM Data Collection

class BPOMCollector:
    def __init__(self, scraper):
        self.scraper = scraper

    def search_registered_products(self, product_name, category="obat"):
        """Search BPOM registered product database"""
        proxy = self.scraper.proxies["ID"]

        try:
            response = requests.get(
                "https://cekbpom.pom.go.id/search",
                params={
                    "query": product_name,
                    "kategori": category
                },
                proxies=proxy,
                headers=self.scraper.get_mobile_headers("id-ID"),
                timeout=30
            )

            if response.status_code == 200:
                return self.parse_bpom_search_results(response.text)
        except Exception as e:
            print(f"Error searching BPOM: {e}")
        return []

    def parse_bpom_search_results(self, html):
        """Parse BPOM search results page"""
        soup = BeautifulSoup(html, "html.parser")
        results = []

        product_items = soup.select(".product-item, .search-result-item")
        for item in product_items:
            registration_no = item.select_one(
                ".registration-number, .nomor-registrasi"
            )
            product_name = item.select_one(
                ".product-name, .nama-produk"
            )
            manufacturer = item.select_one(
                ".manufacturer, .produsen"
            )
            category = item.select_one(
                ".category, .kategori"
            )

            if product_name:
                results.append({
                    "registration_number": (
                        registration_no.get_text(strip=True)
                        if registration_no else None
                    ),
                    "product_name": product_name.get_text(strip=True),
                    "manufacturer": (
                        manufacturer.get_text(strip=True)
                        if manufacturer else None
                    ),
                    "category": (
                        category.get_text(strip=True) if category else None
                    ),
                    "source": "BPOM",
                    "country": "ID",
                    "collected_at": datetime.utcnow().isoformat()
                })

        return results

    def get_product_details(self, registration_number):
        """Get detailed product information from BPOM"""
        proxy = self.scraper.proxies["ID"]

        try:
            response = requests.get(
                f"https://cekbpom.pom.go.id/product/{registration_number}",
                proxies=proxy,
                headers=self.scraper.get_mobile_headers("id-ID"),
                timeout=30
            )

            if response.status_code == 200:
                return self.parse_bpom_product_detail(response.text)
        except Exception as e:
            print(f"Error getting BPOM product details: {e}")
        return None

    def monitor_safety_alerts(self):
        """Monitor BPOM safety alerts and recalls"""
        proxy = self.scraper.proxies["ID"]

        try:
            response = requests.get(
                "https://www.pom.go.id/new/view/more/penarikan/obat",
                proxies=proxy,
                headers=self.scraper.get_mobile_headers("id-ID"),
                timeout=30
            )

            if response.status_code == 200:
                return self.parse_safety_alerts(response.text)
        except Exception as e:
            print(f"Error monitoring BPOM alerts: {e}")
        return []

Change Detection and Alerting

Monitoring for New Approvals

class ApprovalMonitor:
    def __init__(self, fda_collector, bpom_collector, db):
        self.fda = fda_collector
        self.bpom = bpom_collector
        self.db = db

    def check_for_new_approvals(self, watched_drugs):
        """Check for new drug approvals across FDA and BPOM"""
        alerts = []

        for drug in watched_drugs:
            # Check FDA
            fda_results = self.fda.search_drug_approvals(drug["name"])
            for result in fda_results:
                if not self.db.is_known_approval("FDA", result):
                    alerts.append({
                        "agency": "FDA",
                        "drug": drug["name"],
                        "event": "new_approval",
                        "details": result,
                        "detected_at": datetime.utcnow().isoformat()
                    })
                    self.db.record_approval("FDA", result)

            # Check BPOM
            bpom_results = self.bpom.search_registered_products(drug["name"])
            for result in bpom_results:
                if not self.db.is_known_approval("BPOM", result):
                    alerts.append({
                        "agency": "BPOM",
                        "drug": drug["name"],
                        "event": "new_registration",
                        "details": result,
                        "detected_at": datetime.utcnow().isoformat()
                    })
                    self.db.record_approval("BPOM", result)

        return alerts

Cross-Agency Comparison

def compare_registration_status(drug_name, collectors):
    """Compare drug registration status across agencies"""
    status = {}

    agencies = {
        "FDA": collectors["fda"].search_drug_approvals,
        "BPOM": collectors["bpom"].search_registered_products,
    }

    for agency_name, search_func in agencies.items():
        results = search_func(drug_name)
        status[agency_name] = {
            "registered": len(results) > 0,
            "registration_count": len(results),
            "details": results[:5],
            "checked_at": datetime.utcnow().isoformat()
        }

    return status

Data Normalization

Standardize data from different agencies:

class RegulatoryDataNormalizer:
    def normalize_fda_data(self, fda_record):
        return {
            "source_agency": "FDA",
            "country": "US",
            "product_name": fda_record.get("brand_name", ""),
            "active_ingredient": fda_record.get("active_ingredient", ""),
            "registration_type": fda_record.get("application_type", ""),
            "registration_id": fda_record.get("application_number", ""),
            "status": "approved",
            "approval_date": fda_record.get("approval_date"),
            "company": fda_record.get("sponsor_name", ""),
            "normalized_at": datetime.utcnow().isoformat()
        }

    def normalize_bpom_data(self, bpom_record):
        return {
            "source_agency": "BPOM",
            "country": "ID",
            "product_name": bpom_record.get("product_name", ""),
            "active_ingredient": bpom_record.get("kandungan", ""),
            "registration_type": bpom_record.get("category", ""),
            "registration_id": bpom_record.get("registration_number", ""),
            "status": "registered",
            "approval_date": bpom_record.get("registration_date"),
            "company": bpom_record.get("manufacturer", ""),
            "normalized_at": datetime.utcnow().isoformat()
        }

Best Practices

  1. Use the openFDA API first: For FDA data, start with the official API and supplement with web scraping only when the API is insufficient.
  1. Use Indonesian mobile proxies for BPOM: DataResearchTools mobile proxies in Indonesia provide the most reliable access to BPOM databases, which may serve different content to international visitors.
  1. Implement robust error handling: Regulatory websites experience downtime and changes. Build resilient scrapers that log errors and retry intelligently.
  1. Cache extensively: Regulatory data does not change once published. Cache approved drug records to reduce unnecessary requests.
  1. Monitor for website changes: Regulatory agencies periodically redesign their websites. Set up alerts when your parsers start returning unexpected results.
  1. Cross-reference sources: Validate data collected from one agency against other sources to ensure accuracy.
  1. Respect rate limits: Even with proxy rotation through DataResearchTools, implement reasonable delays. Sustainable access is more valuable than maximum speed.

Conclusion

Scraping FDA and BPOM regulatory filings provides essential intelligence for pharmaceutical companies operating across US and Indonesian markets. DataResearchTools mobile proxies in the US and Indonesia ensure reliable access to these critical databases, enabling automated monitoring of drug approvals, safety alerts, and regulatory actions.

By extending this approach to other SEA regulatory agencies using DataResearchTools mobile proxies in Singapore, Thailand, the Philippines, Malaysia, and Vietnam, you can build a comprehensive regulatory intelligence system covering all major markets in the region.

Start monitoring regulatory filings with DataResearchTools today and stay ahead of regulatory developments that affect your pharmaceutical business.


Related Reading

Scroll to Top