Scraping Court Records and Legal Databases with Proxies

Scraping Court Records and Legal Databases with Proxies

Court records and legal databases contain information critical to due diligence, litigation research, compliance monitoring, and risk assessment. Across Southeast Asia, court systems are increasingly digitizing their records, making automated extraction both possible and valuable.

However, legal databases present unique scraping challenges including strict access controls, sensitive data considerations, and complex navigation structures. This guide covers how to approach court record scraping with the right proxy infrastructure and technical strategies.

Why Court Records Matter for Business

Due Diligence

Before entering business partnerships, making acquisitions, or extending credit, organizations need to know if counterparties have pending litigation, judgments against them, or histories of legal disputes.

Litigation Research

Law firms and legal departments use court records to research case precedents, track opposing counsel strategies, and monitor relevant legal developments.

Compliance and Risk Management

Companies in regulated industries must monitor legal proceedings that could affect their operations, supply chain partners, or industry regulations.

Investigative Research

Journalists, researchers, and NGOs use court records to investigate corruption, corporate misconduct, and governance issues.

Real Estate and Property

Court records reveal liens, foreclosures, and property disputes that affect real estate transactions and investment decisions.

Court Record Systems in Southeast Asia

Singapore

  • Supreme Court (supremecourt.gov.sg): High Court and Court of Appeal decisions
  • State Courts (statecourts.gov.sg): District and Magistrate Court records
  • LawNet (lawnet.sg): Comprehensive legal database (subscription required for full access)
  • Singapore Law Watch: Free access to selected judgments and legal news

Singapore’s judiciary is highly digitized, with many judgments available online. The eLitigation system manages electronic filing and case management.

Indonesia

  • Direktori Putusan (putusan3.mahkamahagung.go.id): Supreme Court decision directory
  • SIPP (sipp.pn-jakartapusat.go.id): Court information system (per court)
  • Mahkamah Konstitusi (mkri.id): Constitutional Court decisions

Indonesia publishes court decisions through the Supreme Court’s decision directory, which contains millions of decisions across all court levels.

Philippines

  • Supreme Court E-Library (elibrary.judiciary.gov.ph): Supreme Court decisions
  • Court of Appeals decisions published selectively online
  • eCourts system for case status tracking

Thailand

  • Supreme Court (deka.supremecourt.or.th): Published decisions
  • Administrative Court: Published rulings on administrative matters

Malaysia

  • eCourts (efiling.kehakiman.gov.my): Electronic filing and case management
  • CLJ (cljlaw.com): Current Law Journal (subscription database)
  • MLJ (mlj.com.my): Malayan Law Journal (subscription database)

Technical Challenges of Court Record Scraping

Access Controls

Court record systems often employ sophisticated access controls:

  • User registration and authentication requirements
  • CAPTCHA challenges on search pages
  • Rate limiting and IP-based restrictions
  • Session timeouts after short periods of inactivity

Complex Search Interfaces

Legal databases use multi-criteria search forms that require specific parameters:

  • Case numbers in specific formats
  • Date ranges with particular formatting
  • Category and court type selections
  • Party name matching with exact or fuzzy options

Dynamic Content

Modern court record systems use JavaScript frameworks that load data dynamically:

  • AJAX-powered search results
  • Lazy-loaded case details
  • Paginated results with server-side rendering
  • Pop-up windows for document viewing

Document Formats

Court documents come in various formats:

  • HTML-rendered decisions
  • PDF judgments (both text and scanned)
  • Image-based documents requiring OCR
  • Protected PDFs with copy restrictions

Proxy Strategy for Legal Database Scraping

Mobile Proxies for Court Systems

Court record systems are public access portals that must remain accessible to citizens. Mobile proxies from DataResearchTools mimic legitimate citizen access, significantly reducing block rates.

Country-Specific Proxy Selection

class LegalProxyManager:
    """Manage proxies for court record scraping across ASEAN."""

    COURT_DOMAINS = {
        'supremecourt.gov.sg': {'country': 'SG', 'session_type': 'sticky'},
        'statecourts.gov.sg': {'country': 'SG', 'session_type': 'sticky'},
        'putusan3.mahkamahagung.go.id': {'country': 'ID', 'session_type': 'rotating'},
        'elibrary.judiciary.gov.ph': {'country': 'PH', 'session_type': 'sticky'},
        'deka.supremecourt.or.th': {'country': 'TH', 'session_type': 'rotating'},
        'efiling.kehakiman.gov.my': {'country': 'MY', 'session_type': 'sticky'},
    }

    def __init__(self, proxy_base_url):
        self.proxy_base = proxy_base_url

    def get_proxy(self, target_url):
        """Get appropriate proxy for a court record URL."""
        from urllib.parse import urlparse
        domain = urlparse(target_url).netloc

        config = self.COURT_DOMAINS.get(domain, {
            'country': 'SG', 'session_type': 'rotating'
        })

        return self._build_proxy(
            country=config['country'],
            session_type=config['session_type']
        )

Session Management for Case Research

Court record research often involves navigating from search results to case details to documents. Maintain session continuity:

class CourtSession:
    """Maintain a consistent session for court record research."""

    def __init__(self, proxy_manager, court_url):
        self.proxy_manager = proxy_manager
        self.court_url = court_url
        self.session = requests.Session()
        self.proxy = proxy_manager.get_proxy(court_url)
        self.session.proxies = self.proxy
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Encoding': 'gzip, deflate, br'
        })

    def search_cases(self, query_params):
        """Search for cases matching criteria."""
        response = self.session.get(
            f"{self.court_url}/search",
            params=query_params,
            timeout=30
        )
        return response

    def get_case_detail(self, case_id):
        """Fetch full case details."""
        time.sleep(random.uniform(3, 6))
        response = self.session.get(
            f"{self.court_url}/case/{case_id}",
            timeout=30
        )
        return response

    def download_document(self, doc_url):
        """Download a case document."""
        time.sleep(random.uniform(2, 4))
        response = self.session.get(doc_url, timeout=120, stream=True)
        return response.content

Building Court Record Scrapers

Indonesia Supreme Court Decisions

The Indonesian Supreme Court decision directory is one of the largest legal databases in ASEAN:

class IndonesiaCourtScraper:
    """Scraper for Indonesian Supreme Court decision directory."""

    BASE_URL = "https://putusan3.mahkamahagung.go.id"

    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager

    def search_decisions(self, keyword=None, court_type=None,
                         date_from=None, date_to=None, page=1):
        """Search for court decisions."""
        proxy = self.proxy_manager.get_proxy(self.BASE_URL)
        session = requests.Session()
        session.proxies = proxy

        params = {'page': page}
        if keyword:
            params['q'] = keyword
        if court_type:
            params['cat'] = court_type

        response = session.get(
            f"{self.BASE_URL}/search",
            params=params,
            headers={
                'User-Agent': 'Mozilla/5.0 (Linux; Android 13)',
                'Accept-Language': 'id-ID,id;q=0.9'
            },
            timeout=30
        )

        return self.parse_search_results(response.text)

    def parse_search_results(self, html):
        """Parse decision search results."""
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')

        decisions = []
        for entry in soup.select('.result-item'):
            decision = {
                'title': entry.select_one('.title').get_text(strip=True) if entry.select_one('.title') else '',
                'case_number': entry.select_one('.case-number').get_text(strip=True) if entry.select_one('.case-number') else '',
                'court': entry.select_one('.court').get_text(strip=True) if entry.select_one('.court') else '',
                'date': entry.select_one('.date').get_text(strip=True) if entry.select_one('.date') else '',
                'link': entry.select_one('a')['href'] if entry.select_one('a') else ''
            }
            decisions.append(decision)

        return decisions

    def fetch_decision_detail(self, decision_url):
        """Fetch full decision details."""
        proxy = self.proxy_manager.get_proxy(decision_url)
        session = requests.Session()
        session.proxies = proxy

        response = session.get(decision_url, timeout=30)
        return self.parse_decision(response.text)

    def parse_decision(self, html):
        """Parse a court decision page."""
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')

        detail = {
            'case_number': '',
            'court_level': '',
            'judges': [],
            'parties': {'plaintiff': '', 'defendant': ''},
            'decision_date': '',
            'decision_type': '',
            'summary': '',
            'full_text_url': ''
        }

        # Extract structured fields from the detail page
        info_table = soup.find('table', class_='table')
        if info_table:
            for row in info_table.find_all('tr'):
                cells = row.find_all('td')
                if len(cells) >= 2:
                    key = cells[0].get_text(strip=True).lower()
                    value = cells[1].get_text(strip=True)

                    if 'nomor' in key:
                        detail['case_number'] = value
                    elif 'tingkat' in key:
                        detail['court_level'] = value
                    elif 'hakim' in key:
                        detail['judges'].append(value)
                    elif 'tanggal' in key:
                        detail['decision_date'] = value

        # Look for PDF download link
        pdf_link = soup.find('a', href=lambda x: x and '.pdf' in str(x).lower())
        if pdf_link:
            detail['full_text_url'] = pdf_link['href']

        return detail

Singapore Supreme Court Judgments

class SingaporeCourtScraper:
    """Scraper for Singapore court judgments."""

    BASE_URL = "https://www.supremecourt.gov.sg"

    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager

    def fetch_recent_judgments(self, year=None, page=1):
        """Fetch recently published judgments."""
        proxy = self.proxy_manager.get_proxy(self.BASE_URL)
        session = requests.Session()
        session.proxies = proxy

        response = session.get(
            f"{self.BASE_URL}/news/supreme-court-judgments",
            params={'year': year, 'page': page},
            headers={
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
                'Accept-Language': 'en-SG,en;q=0.9'
            },
            timeout=30
        )

        return self.parse_judgment_list(response.text)

Data Processing Pipeline

Entity Extraction

Extract key entities from court records:

class LegalEntityExtractor:
    """Extract entities from court record text."""

    def extract_parties(self, text):
        """Extract plaintiff and defendant names."""
        parties = {'plaintiffs': [], 'defendants': []}

        # Pattern matching for common formats
        import re

        plaintiff_pattern = r'(?:Plaintiff|Penggugat|Pemohon)[\s:]+([^\n]+)'
        defendant_pattern = r'(?:Defendant|Tergugat|Termohon)[\s:]+([^\n]+)'

        for match in re.finditer(plaintiff_pattern, text, re.IGNORECASE):
            parties['plaintiffs'].append(match.group(1).strip())

        for match in re.finditer(defendant_pattern, text, re.IGNORECASE):
            parties['defendants'].append(match.group(1).strip())

        return parties

    def extract_case_references(self, text):
        """Extract references to other cases cited."""
        import re
        patterns = [
            r'\[\d{4}\]\s+\d+\s+SLR\s+\d+',          # Singapore Law Reports
            r'\[\d{4}\]\s+SGHC\s+\d+',                 # Singapore High Court
            r'\[\d{4}\]\s+SGCA\s+\d+',                 # Singapore Court of Appeal
            r'No\.\s+\d+/Pdt\.G/\d{4}/PN\s+\w+',     # Indonesian civil case
            r'G\.R\.\s+No\.\s+\d+',                    # Philippine Supreme Court
        ]

        references = []
        for pattern in patterns:
            for match in re.finditer(pattern, text):
                references.append(match.group())

        return references

Case Classification

class CaseClassifier:
    """Classify court cases by subject matter."""

    CATEGORIES = {
        'commercial': ['contract', 'breach', 'commercial', 'business', 'perdagangan'],
        'property': ['property', 'land', 'lease', 'tenant', 'tanah', 'properti'],
        'employment': ['employment', 'termination', 'wages', 'labor', 'ketenagakerjaan'],
        'intellectual_property': ['patent', 'trademark', 'copyright', 'IP', 'merek'],
        'tax': ['tax', 'revenue', 'assessment', 'pajak'],
        'criminal': ['criminal', 'fraud', 'corruption', 'pidana', 'korupsi'],
        'family': ['divorce', 'custody', 'maintenance', 'perceraian'],
        'constitutional': ['constitutional', 'fundamental rights', 'konstitusi'],
    }

    def classify(self, case_text):
        """Classify a case by subject matter."""
        text_lower = case_text.lower()
        scores = {}

        for category, keywords in self.CATEGORIES.items():
            score = sum(text_lower.count(kw) for kw in keywords)
            if score > 0:
                scores[category] = score

        if scores:
            return sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [('unclassified', 0)]

Ethical and Legal Considerations

Court record scraping requires careful attention to legal and ethical boundaries:

Public Records Principle

Court records are generally public documents, but access rules vary by jurisdiction:

  • Singapore: Published judgments are freely accessible, but some case details may be restricted
  • Indonesia: The Supreme Court actively publishes decisions online for public access
  • Philippines: Supreme Court decisions are public, lower court records may have restrictions
  • Malaysia: Some records are behind authentication

Data Sensitivity

Court records may contain sensitive personal information. Handle data responsibly:

  • Do not extract or store unnecessary personal data
  • Implement access controls on your database
  • Comply with data protection laws (PDPA, UU PDP, DPA)
  • Consider anonymizing personal details in your datasets

Terms of Service

Review and respect the terms of service for each court record system. Some explicitly prohibit automated access, while others welcome it.

Scraping Etiquette

  • Use conservative request rates (5-10 second delays)
  • Avoid scraping during business hours when court staff rely on these systems
  • Do not attempt to access restricted or sealed records
  • Cache aggressively to minimize repeat requests

DataResearchTools for Legal Research

DataResearchTools provides the proxy infrastructure needed for court record research across Southeast Asia:

  • Multi-country mobile proxies for accessing court systems in every ASEAN jurisdiction
  • Sticky sessions for navigating complex court record interfaces
  • High trust IPs that are not flagged by government security systems
  • Reliable connectivity for consistent data collection

Our proxy network supports the sensitive nature of legal research while providing the technical capabilities needed for effective data extraction.

Conclusion

Court record scraping is a specialized but high-value application of proxy-powered data collection. The legal databases across Southeast Asia contain intelligence that supports due diligence, litigation research, compliance monitoring, and investigative journalism.

With DataResearchTools providing reliable proxy infrastructure across all ASEAN markets, organizations can build systematic court record monitoring capabilities. Start with the jurisdictions most relevant to your needs, build purpose-specific parsers, and always prioritize ethical data handling practices. The legal intelligence you extract will become an invaluable asset for risk management and decision-making.


Related Reading

Scroll to Top