How to Scrape Government Budget and Spending Data

How to Scrape Government Budget and Spending Data

Government budget and spending data reveals where public money flows, which sectors receive priority investment, and how effectively funds are utilized. For businesses, investors, researchers, and civil society organizations, this data drives strategic decisions worth billions of dollars across Southeast Asia.

This guide explains how to collect government budget and spending data systematically using proxy-powered scraping infrastructure.

The Value of Government Fiscal Data

For Businesses

Understanding government spending patterns helps businesses:

  • Identify growing sectors receiving increased government investment
  • Time market entry with major infrastructure spending cycles
  • Forecast procurement opportunities based on budget allocations
  • Assess country-level economic priorities for strategic planning

For Investors

Government fiscal data provides insights into:

  • Economic stability and fiscal discipline
  • Infrastructure investment pipelines
  • Sector-level growth trajectories
  • Currency and monetary policy direction

For Researchers and NGOs

Fiscal transparency data enables:

  • Public accountability and anti-corruption analysis
  • Aid effectiveness evaluation
  • Social spending adequacy assessments
  • Cross-country fiscal policy comparison

Government Budget Data Sources in ASEAN

Singapore

  • Ministry of Finance Budget Portal: Annual budget statements, revenue and expenditure details
  • data.gov.sg: Government financial datasets in machine-readable formats
  • Department of Statistics: Economic and fiscal statistics
  • AGO (Auditor-General’s Office): Audit reports on government spending

Singapore provides among the most transparent and accessible fiscal data in ASEAN, with structured datasets available through the open data portal.

Indonesia

  • Kementerian Keuangan (Ministry of Finance): Budget documents, fiscal reports
  • DJPB (Direktorat Jenderal Perbendaharaan): Treasury and spending data
  • data.go.id: Open data portal with fiscal datasets
  • BPK (Badan Pemeriksa Keuangan): Supreme audit institution reports
  • APBN (Anggaran Pendapatan dan Belanja Negara): National budget documents

Indonesia publishes extensive budget data, though much of it is in PDF format requiring additional processing.

Philippines

  • DBM (Department of Budget and Management): Budget documents, allocation data
  • COA (Commission on Audit): Annual audit reports
  • BTr (Bureau of the Treasury): Revenue and disbursement data
  • NEDA (National Economic and Development Authority): Development spending data

Thailand

  • Bureau of the Budget: Annual budget documents
  • Comptroller General’s Department: Disbursement data
  • NESDC (National Economic and Social Development Council): Development expenditure

Malaysia

  • Treasury (Perbendaharaan): Federal budget documents
  • Accountant General’s Department: Government financial statements
  • Auditor General: Audit reports on federal spending

Vietnam

  • Ministry of Finance: Budget documents and fiscal reports
  • State Treasury: Budget execution data
  • State Audit Office: Government audit reports

Building a Fiscal Data Collection System

Architecture

class FiscalDataCollector:
    """Collect government budget and spending data across ASEAN."""

    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager
        self.parsers = {
            'html_table': HTMLTableParser(),
            'pdf_document': PDFDocumentParser(),
            'json_api': JSONAPIParser(),
            'csv_download': CSVDownloadParser(),
            'excel_download': ExcelDownloadParser()
        }

    def collect_from_source(self, source_config):
        """Collect fiscal data from a configured source."""
        country = source_config['country']
        proxy = self.proxy_manager.get_proxy_for_country(country)

        session = requests.Session()
        session.proxies = proxy
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
            'Accept-Language': self._get_language(country)
        })

        data_format = source_config.get('format', 'html_table')
        parser = self.parsers[data_format]

        try:
            response = session.get(
                source_config['url'],
                timeout=60
            )

            if data_format in ['pdf_document', 'excel_download', 'csv_download']:
                return parser.parse(response.content, source_config)
            else:
                return parser.parse(response.text, source_config)

        except Exception as e:
            return {'error': str(e), 'source': source_config['name']}

Handling PDF Budget Documents

Many government budget documents are published as PDFs:

import io
from PyPDF2 import PdfReader
import tabula

class PDFBudgetParser:
    """Parse budget data from government PDF documents."""

    def extract_budget_tables(self, pdf_content):
        """Extract tabular data from budget PDFs."""
        pdf_bytes = io.BytesIO(pdf_content)

        # Try tabula for structured tables
        try:
            tables = tabula.read_pdf(
                pdf_bytes,
                pages='all',
                multiple_tables=True,
                lattice=True
            )
            return self._process_tables(tables)
        except Exception:
            pass

        # Fallback to text extraction
        reader = PdfReader(pdf_bytes)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"

        return self._parse_text_tables(text)

    def _process_tables(self, tables):
        """Process extracted tables into structured budget data."""
        budget_items = []
        for table in tables:
            for _, row in table.iterrows():
                item = self._parse_budget_row(row)
                if item:
                    budget_items.append(item)
        return budget_items

    def download_budget_pdf(self, url, proxy_manager, country):
        """Download a budget PDF through proxy."""
        proxy = proxy_manager.get_proxy_for_country(country)
        response = requests.get(
            url, proxies=proxy, timeout=120, stream=True
        )
        if response.status_code == 200:
            return response.content
        return None

Scraping Open Data APIs

Where available, use structured APIs for cleaner data:

class OpenDataFiscalScraper:
    """Scrape fiscal data from open data portals."""

    def scrape_singapore_budget(self, proxy_manager):
        """Fetch budget data from data.gov.sg."""
        proxy = proxy_manager.get_proxy_for_country('SG')

        # data.gov.sg provides structured API access
        datasets = [
            'government-fiscal-position-annual',
            'government-operating-revenue',
            'government-operating-expenditure',
            'government-development-expenditure'
        ]

        fiscal_data = {}
        for dataset_id in datasets:
            response = requests.get(
                f"https://data.gov.sg/api/action/datastore_search",
                params={
                    'resource_id': dataset_id,
                    'limit': 1000
                },
                proxies=proxy,
                timeout=30
            )

            if response.status_code == 200:
                fiscal_data[dataset_id] = response.json().get('result', {}).get('records', [])

            time.sleep(2)

        return fiscal_data

    def scrape_indonesia_budget(self, proxy_manager):
        """Fetch budget data from Indonesian sources."""
        proxy = proxy_manager.get_proxy_for_country('ID')
        session = requests.Session()
        session.proxies = proxy

        # DJPB provides some data via web interface
        response = session.get(
            "https://www.djpb.kemenkeu.go.id/portal/id/data-belanja.html",
            headers={'Accept-Language': 'id-ID,id;q=0.9'},
            timeout=30
        )

        return self._parse_indonesian_budget(response.text)

Data Normalization

Unified Budget Schema

Normalize budget data from different countries into a comparable format:

class BudgetNormalizer:
    """Normalize budget data across countries into comparable format."""

    EXCHANGE_RATES = {
        'SGD': 0.74, 'IDR': 0.000063, 'PHP': 0.018,
        'THB': 0.028, 'MYR': 0.21, 'VND': 0.000040
    }

    def normalize(self, budget_item, country):
        """Normalize a budget item."""
        currency = self._get_currency(country)
        amount_local = self._parse_amount(budget_item.get('amount', 0))

        return {
            'country': country,
            'fiscal_year': budget_item.get('fiscal_year'),
            'ministry_agency': budget_item.get('ministry', ''),
            'category': self._normalize_category(
                budget_item.get('category', ''), country
            ),
            'subcategory': budget_item.get('subcategory', ''),
            'budget_type': budget_item.get('type', ''),  # allocation/spending
            'amount_local': amount_local,
            'currency': currency,
            'amount_usd': amount_local * self.EXCHANGE_RATES.get(currency, 1),
            'period': budget_item.get('period', 'annual'),
            'source': budget_item.get('source', ''),
            'scraped_at': datetime.utcnow().isoformat()
        }

    def _normalize_category(self, category, country):
        """Map country-specific categories to unified taxonomy."""
        category_map = {
            'defense': ['defense', 'pertahanan', 'military'],
            'education': ['education', 'pendidikan', 'schools'],
            'healthcare': ['health', 'kesehatan', 'medical'],
            'infrastructure': ['infrastructure', 'public works', 'pekerjaan umum'],
            'social_welfare': ['social', 'welfare', 'sosial', 'bantuan'],
            'economic': ['economic', 'trade', 'industry', 'ekonomi'],
            'debt_service': ['debt', 'interest', 'utang'],
        }

        category_lower = category.lower()
        for unified, keywords in category_map.items():
            if any(kw in category_lower for kw in keywords):
                return unified
        return 'other'

Analysis and Visualization

Spending Trend Analysis

class FiscalAnalyzer:
    """Analyze government fiscal data."""

    def spending_trends(self, db, country, category=None, years=10):
        """Analyze spending trends over time."""
        data = db.get_spending_history(
            country=country, category=category, years=years
        )

        return {
            'country': country,
            'category': category or 'all',
            'years': [d['fiscal_year'] for d in data],
            'amounts_local': [d['amount_local'] for d in data],
            'amounts_usd': [d['amount_usd'] for d in data],
            'yoy_growth': self._calculate_growth(data),
            'cagr': self._calculate_cagr(data),
            'as_pct_of_total': self._calculate_share(data, country)
        }

    def cross_country_comparison(self, db, category, year):
        """Compare spending across countries for a category."""
        countries = ['SG', 'ID', 'PH', 'TH', 'MY', 'VN']
        comparison = []

        for country in countries:
            data = db.get_spending(
                country=country, category=category, year=year
            )
            if data:
                comparison.append({
                    'country': country,
                    'amount_usd': data['amount_usd'],
                    'per_capita_usd': data['amount_usd'] / self._get_population(country),
                    'as_pct_gdp': data['amount_usd'] / self._get_gdp(country) * 100
                })

        return sorted(comparison, key=lambda x: x['per_capita_usd'], reverse=True)

    def identify_growth_sectors(self, db, country, years=5):
        """Identify sectors with fastest budget growth."""
        categories = db.get_all_categories(country)
        growth_data = []

        for cat in categories:
            trend = self.spending_trends(db, country, cat, years)
            if trend['cagr'] is not None:
                growth_data.append({
                    'category': cat,
                    'cagr': trend['cagr'],
                    'latest_amount': trend['amounts_usd'][-1] if trend['amounts_usd'] else 0
                })

        return sorted(growth_data, key=lambda x: x['cagr'], reverse=True)

Budget Execution Monitoring

Track how well governments execute their budgets:

def analyze_budget_execution(db, country, fiscal_year):
    """Compare budget allocations to actual spending."""
    allocations = db.get_allocations(country, fiscal_year)
    spending = db.get_actual_spending(country, fiscal_year)

    execution = []
    for category in allocations:
        allocated = allocations[category]
        spent = spending.get(category, 0)

        execution.append({
            'category': category,
            'allocated': allocated,
            'spent': spent,
            'execution_rate': (spent / allocated * 100) if allocated > 0 else 0,
            'variance': spent - allocated,
            'under_spent': allocated - spent if spent < allocated else 0,
            'over_spent': spent - allocated if spent > allocated else 0
        })

    return sorted(execution, key=lambda x: x['execution_rate'])

Use Cases

Market Entry Timing

Use budget allocation data to time market entry. A significant budget increase for digital government services signals opportunities for IT vendors. Rising healthcare spending indicates demand for medical equipment and services.

Infrastructure Investment Analysis

Track infrastructure budget allocations across countries to identify where major construction and development opportunities will emerge. Combine budget data with procurement data for a complete picture.

Policy Impact Assessment

Budget changes reflect policy priorities. Track how budget allocations shift in response to policy announcements to understand real government priorities versus stated intentions.

Economic Forecasting

Government spending is a significant component of GDP in ASEAN economies. Fiscal data provides early indicators of economic direction and growth potential.

DataResearchTools for Fiscal Data Collection

DataResearchTools provides the proxy infrastructure for collecting fiscal data across ASEAN:

  • Multi-country access to ministry of finance websites, open data portals, and statistical agencies
  • High-bandwidth connections for downloading large PDF budget documents
  • Reliable scheduling for regular data collection cycles
  • Native mobile IPs that access government portals without restrictions
  • Cost-effective plans for ongoing fiscal monitoring operations

Conclusion

Government budget and spending data is a cornerstone of business intelligence in Southeast Asia. By building a systematic collection pipeline powered by DataResearchTools’ proxy infrastructure, organizations can transform scattered fiscal data from multiple countries into actionable intelligence for investment decisions, market strategy, and policy analysis.

Start with the countries and sectors most relevant to your operations, build robust parsers for each data format you encounter, and gradually expand to create a comprehensive fiscal intelligence capability across ASEAN.


Related Reading

Scroll to Top