Scraping Court Records and Legal Databases with Proxies
Court records and legal databases contain information critical to due diligence, litigation research, compliance monitoring, and risk assessment. Across Southeast Asia, court systems are increasingly digitizing their records, making automated extraction both possible and valuable.
However, legal databases present unique scraping challenges including strict access controls, sensitive data considerations, and complex navigation structures. This guide covers how to approach court record scraping with the right proxy infrastructure and technical strategies.
Why Court Records Matter for Business
Due Diligence
Before entering business partnerships, making acquisitions, or extending credit, organizations need to know if counterparties have pending litigation, judgments against them, or histories of legal disputes.
Litigation Research
Law firms and legal departments use court records to research case precedents, track opposing counsel strategies, and monitor relevant legal developments.
Compliance and Risk Management
Companies in regulated industries must monitor legal proceedings that could affect their operations, supply chain partners, or industry regulations.
Investigative Research
Journalists, researchers, and NGOs use court records to investigate corruption, corporate misconduct, and governance issues.
Real Estate and Property
Court records reveal liens, foreclosures, and property disputes that affect real estate transactions and investment decisions.
Court Record Systems in Southeast Asia
Singapore
- Supreme Court (supremecourt.gov.sg): High Court and Court of Appeal decisions
- State Courts (statecourts.gov.sg): District and Magistrate Court records
- LawNet (lawnet.sg): Comprehensive legal database (subscription required for full access)
- Singapore Law Watch: Free access to selected judgments and legal news
Singapore’s judiciary is highly digitized, with many judgments available online. The eLitigation system manages electronic filing and case management.
Indonesia
- Direktori Putusan (putusan3.mahkamahagung.go.id): Supreme Court decision directory
- SIPP (sipp.pn-jakartapusat.go.id): Court information system (per court)
- Mahkamah Konstitusi (mkri.id): Constitutional Court decisions
Indonesia publishes court decisions through the Supreme Court’s decision directory, which contains millions of decisions across all court levels.
Philippines
- Supreme Court E-Library (elibrary.judiciary.gov.ph): Supreme Court decisions
- Court of Appeals decisions published selectively online
- eCourts system for case status tracking
Thailand
- Supreme Court (deka.supremecourt.or.th): Published decisions
- Administrative Court: Published rulings on administrative matters
Malaysia
- eCourts (efiling.kehakiman.gov.my): Electronic filing and case management
- CLJ (cljlaw.com): Current Law Journal (subscription database)
- MLJ (mlj.com.my): Malayan Law Journal (subscription database)
Technical Challenges of Court Record Scraping
Access Controls
Court record systems often employ sophisticated access controls:
- User registration and authentication requirements
- CAPTCHA challenges on search pages
- Rate limiting and IP-based restrictions
- Session timeouts after short periods of inactivity
Complex Search Interfaces
Legal databases use multi-criteria search forms that require specific parameters:
- Case numbers in specific formats
- Date ranges with particular formatting
- Category and court type selections
- Party name matching with exact or fuzzy options
Dynamic Content
Modern court record systems use JavaScript frameworks that load data dynamically:
- AJAX-powered search results
- Lazy-loaded case details
- Paginated results with server-side rendering
- Pop-up windows for document viewing
Document Formats
Court documents come in various formats:
- HTML-rendered decisions
- PDF judgments (both text and scanned)
- Image-based documents requiring OCR
- Protected PDFs with copy restrictions
Proxy Strategy for Legal Database Scraping
Mobile Proxies for Court Systems
Court record systems are public access portals that must remain accessible to citizens. Mobile proxies from DataResearchTools mimic legitimate citizen access, significantly reducing block rates.
Country-Specific Proxy Selection
class LegalProxyManager:
"""Manage proxies for court record scraping across ASEAN."""
COURT_DOMAINS = {
'supremecourt.gov.sg': {'country': 'SG', 'session_type': 'sticky'},
'statecourts.gov.sg': {'country': 'SG', 'session_type': 'sticky'},
'putusan3.mahkamahagung.go.id': {'country': 'ID', 'session_type': 'rotating'},
'elibrary.judiciary.gov.ph': {'country': 'PH', 'session_type': 'sticky'},
'deka.supremecourt.or.th': {'country': 'TH', 'session_type': 'rotating'},
'efiling.kehakiman.gov.my': {'country': 'MY', 'session_type': 'sticky'},
}
def __init__(self, proxy_base_url):
self.proxy_base = proxy_base_url
def get_proxy(self, target_url):
"""Get appropriate proxy for a court record URL."""
from urllib.parse import urlparse
domain = urlparse(target_url).netloc
config = self.COURT_DOMAINS.get(domain, {
'country': 'SG', 'session_type': 'rotating'
})
return self._build_proxy(
country=config['country'],
session_type=config['session_type']
)Session Management for Case Research
Court record research often involves navigating from search results to case details to documents. Maintain session continuity:
class CourtSession:
"""Maintain a consistent session for court record research."""
def __init__(self, proxy_manager, court_url):
self.proxy_manager = proxy_manager
self.court_url = court_url
self.session = requests.Session()
self.proxy = proxy_manager.get_proxy(court_url)
self.session.proxies = self.proxy
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Encoding': 'gzip, deflate, br'
})
def search_cases(self, query_params):
"""Search for cases matching criteria."""
response = self.session.get(
f"{self.court_url}/search",
params=query_params,
timeout=30
)
return response
def get_case_detail(self, case_id):
"""Fetch full case details."""
time.sleep(random.uniform(3, 6))
response = self.session.get(
f"{self.court_url}/case/{case_id}",
timeout=30
)
return response
def download_document(self, doc_url):
"""Download a case document."""
time.sleep(random.uniform(2, 4))
response = self.session.get(doc_url, timeout=120, stream=True)
return response.contentBuilding Court Record Scrapers
Indonesia Supreme Court Decisions
The Indonesian Supreme Court decision directory is one of the largest legal databases in ASEAN:
class IndonesiaCourtScraper:
"""Scraper for Indonesian Supreme Court decision directory."""
BASE_URL = "https://putusan3.mahkamahagung.go.id"
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
def search_decisions(self, keyword=None, court_type=None,
date_from=None, date_to=None, page=1):
"""Search for court decisions."""
proxy = self.proxy_manager.get_proxy(self.BASE_URL)
session = requests.Session()
session.proxies = proxy
params = {'page': page}
if keyword:
params['q'] = keyword
if court_type:
params['cat'] = court_type
response = session.get(
f"{self.BASE_URL}/search",
params=params,
headers={
'User-Agent': 'Mozilla/5.0 (Linux; Android 13)',
'Accept-Language': 'id-ID,id;q=0.9'
},
timeout=30
)
return self.parse_search_results(response.text)
def parse_search_results(self, html):
"""Parse decision search results."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
decisions = []
for entry in soup.select('.result-item'):
decision = {
'title': entry.select_one('.title').get_text(strip=True) if entry.select_one('.title') else '',
'case_number': entry.select_one('.case-number').get_text(strip=True) if entry.select_one('.case-number') else '',
'court': entry.select_one('.court').get_text(strip=True) if entry.select_one('.court') else '',
'date': entry.select_one('.date').get_text(strip=True) if entry.select_one('.date') else '',
'link': entry.select_one('a')['href'] if entry.select_one('a') else ''
}
decisions.append(decision)
return decisions
def fetch_decision_detail(self, decision_url):
"""Fetch full decision details."""
proxy = self.proxy_manager.get_proxy(decision_url)
session = requests.Session()
session.proxies = proxy
response = session.get(decision_url, timeout=30)
return self.parse_decision(response.text)
def parse_decision(self, html):
"""Parse a court decision page."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
detail = {
'case_number': '',
'court_level': '',
'judges': [],
'parties': {'plaintiff': '', 'defendant': ''},
'decision_date': '',
'decision_type': '',
'summary': '',
'full_text_url': ''
}
# Extract structured fields from the detail page
info_table = soup.find('table', class_='table')
if info_table:
for row in info_table.find_all('tr'):
cells = row.find_all('td')
if len(cells) >= 2:
key = cells[0].get_text(strip=True).lower()
value = cells[1].get_text(strip=True)
if 'nomor' in key:
detail['case_number'] = value
elif 'tingkat' in key:
detail['court_level'] = value
elif 'hakim' in key:
detail['judges'].append(value)
elif 'tanggal' in key:
detail['decision_date'] = value
# Look for PDF download link
pdf_link = soup.find('a', href=lambda x: x and '.pdf' in str(x).lower())
if pdf_link:
detail['full_text_url'] = pdf_link['href']
return detailSingapore Supreme Court Judgments
class SingaporeCourtScraper:
"""Scraper for Singapore court judgments."""
BASE_URL = "https://www.supremecourt.gov.sg"
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
def fetch_recent_judgments(self, year=None, page=1):
"""Fetch recently published judgments."""
proxy = self.proxy_manager.get_proxy(self.BASE_URL)
session = requests.Session()
session.proxies = proxy
response = session.get(
f"{self.BASE_URL}/news/supreme-court-judgments",
params={'year': year, 'page': page},
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
'Accept-Language': 'en-SG,en;q=0.9'
},
timeout=30
)
return self.parse_judgment_list(response.text)Data Processing Pipeline
Entity Extraction
Extract key entities from court records:
class LegalEntityExtractor:
"""Extract entities from court record text."""
def extract_parties(self, text):
"""Extract plaintiff and defendant names."""
parties = {'plaintiffs': [], 'defendants': []}
# Pattern matching for common formats
import re
plaintiff_pattern = r'(?:Plaintiff|Penggugat|Pemohon)[\s:]+([^\n]+)'
defendant_pattern = r'(?:Defendant|Tergugat|Termohon)[\s:]+([^\n]+)'
for match in re.finditer(plaintiff_pattern, text, re.IGNORECASE):
parties['plaintiffs'].append(match.group(1).strip())
for match in re.finditer(defendant_pattern, text, re.IGNORECASE):
parties['defendants'].append(match.group(1).strip())
return parties
def extract_case_references(self, text):
"""Extract references to other cases cited."""
import re
patterns = [
r'\[\d{4}\]\s+\d+\s+SLR\s+\d+', # Singapore Law Reports
r'\[\d{4}\]\s+SGHC\s+\d+', # Singapore High Court
r'\[\d{4}\]\s+SGCA\s+\d+', # Singapore Court of Appeal
r'No\.\s+\d+/Pdt\.G/\d{4}/PN\s+\w+', # Indonesian civil case
r'G\.R\.\s+No\.\s+\d+', # Philippine Supreme Court
]
references = []
for pattern in patterns:
for match in re.finditer(pattern, text):
references.append(match.group())
return referencesCase Classification
class CaseClassifier:
"""Classify court cases by subject matter."""
CATEGORIES = {
'commercial': ['contract', 'breach', 'commercial', 'business', 'perdagangan'],
'property': ['property', 'land', 'lease', 'tenant', 'tanah', 'properti'],
'employment': ['employment', 'termination', 'wages', 'labor', 'ketenagakerjaan'],
'intellectual_property': ['patent', 'trademark', 'copyright', 'IP', 'merek'],
'tax': ['tax', 'revenue', 'assessment', 'pajak'],
'criminal': ['criminal', 'fraud', 'corruption', 'pidana', 'korupsi'],
'family': ['divorce', 'custody', 'maintenance', 'perceraian'],
'constitutional': ['constitutional', 'fundamental rights', 'konstitusi'],
}
def classify(self, case_text):
"""Classify a case by subject matter."""
text_lower = case_text.lower()
scores = {}
for category, keywords in self.CATEGORIES.items():
score = sum(text_lower.count(kw) for kw in keywords)
if score > 0:
scores[category] = score
if scores:
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [('unclassified', 0)]Ethical and Legal Considerations
Court record scraping requires careful attention to legal and ethical boundaries:
Public Records Principle
Court records are generally public documents, but access rules vary by jurisdiction:
- Singapore: Published judgments are freely accessible, but some case details may be restricted
- Indonesia: The Supreme Court actively publishes decisions online for public access
- Philippines: Supreme Court decisions are public, lower court records may have restrictions
- Malaysia: Some records are behind authentication
Data Sensitivity
Court records may contain sensitive personal information. Handle data responsibly:
- Do not extract or store unnecessary personal data
- Implement access controls on your database
- Comply with data protection laws (PDPA, UU PDP, DPA)
- Consider anonymizing personal details in your datasets
Terms of Service
Review and respect the terms of service for each court record system. Some explicitly prohibit automated access, while others welcome it.
Scraping Etiquette
- Use conservative request rates (5-10 second delays)
- Avoid scraping during business hours when court staff rely on these systems
- Do not attempt to access restricted or sealed records
- Cache aggressively to minimize repeat requests
DataResearchTools for Legal Research
DataResearchTools provides the proxy infrastructure needed for court record research across Southeast Asia:
- Multi-country mobile proxies for accessing court systems in every ASEAN jurisdiction
- Sticky sessions for navigating complex court record interfaces
- High trust IPs that are not flagged by government security systems
- Reliable connectivity for consistent data collection
Our proxy network supports the sensitive nature of legal research while providing the technical capabilities needed for effective data extraction.
Conclusion
Court record scraping is a specialized but high-value application of proxy-powered data collection. The legal databases across Southeast Asia contain intelligence that supports due diligence, litigation research, compliance monitoring, and investigative journalism.
With DataResearchTools providing reliable proxy infrastructure across all ASEAN markets, organizations can build systematic court record monitoring capabilities. Start with the jurisdictions most relevant to your needs, build purpose-specific parsers, and always prioritize ethical data handling practices. The legal intelligence you extract will become an invaluable asset for risk management and decision-making.
- Best Proxies for Government Data Scraping
- Building a Legislative Bill Tracker with Proxy-Powered Scraping
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Best Proxies for Government Data Scraping
- Building a Government Contract Intelligence System with Proxies
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Best Proxies for Government Data Scraping
- Building a Government Contract Intelligence System with Proxies
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
Related Reading
- Best Proxies for Government Data Scraping
- Building a Government Contract Intelligence System with Proxies
- How AI + Proxies Are Transforming Drug Discovery Data Pipelines
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)