Build a Search Engine with Web Scraping

Build a Search Engine with Web Scraping

building your own search engine sounds like a massive undertaking. in reality, you can build a functional search engine over a specific domain or niche in a weekend using Python, web scraping, and a few well-chosen libraries. this guide walks you through the entire process from crawling to indexing to querying.

the result will be a vertical search engine that crawls specific websites, indexes their content, and lets you run full-text queries against everything it has collected. you will also learn how to integrate proxy rotation so your crawler can operate reliably at scale without getting blocked.

Why Build a Custom Search Engine

Google is great for general search, but it falls short when you need to search across a specific set of sites or data sources. a custom search engine gives you full control over what gets indexed, how results are ranked, and what metadata you capture.

common use cases include:

  • searching across competitor product catalogs
  • building an internal knowledge base from public documentation
  • creating a research tool that indexes academic sources
  • monitoring specific forums or communities for mentions of your brand

Architecture Overview

our search engine has four main components:

┌──────────────────────┐
│   Query Interface    │  ← Flask web app
├──────────────────────┤
│   Search/Ranking     │  ← TF-IDF + BM25
├──────────────────────┤
│   Inverted Index     │  ← Whoosh or custom
├──────────────────────┤
│   Web Crawler        │  ← Scrapy + proxies
└──────────────────────┘

the crawler feeds pages into the indexer, the indexer builds a searchable index, and the query interface lets users search and get ranked results.

Prerequisites

you will need Python 3.10 or later and the following packages:

pip install scrapy beautifulsoup4 whoosh flask requests lxml

for proxy integration, you will also want:

pip install scrapy-rotating-proxies

Step 1: Build the Web Crawler

the crawler is the foundation. it visits pages, extracts text content, and stores it for indexing. here is a Scrapy spider that crawls a target domain:

import scrapy
from urllib.parse import urlparse
import json
import hashlib


class SearchCrawler(scrapy.Spider):
    name = "search_crawler"

    custom_settings = {
        "DEPTH_LIMIT": 3,
        "DOWNLOAD_DELAY": 1.5,
        "CONCURRENT_REQUESTS": 8,
        "ROBOTSTXT_OBEY": True,
        "USER_AGENT": "SearchBot/1.0 (+https://yourdomain.com/bot)",
    }

    def __init__(self, domains=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        if domains:
            self.start_urls = [f"https://{d}" for d in domains.split(",")]
            self.allowed_domains = domains.split(",")
        else:
            self.start_urls = ["https://example.com"]
            self.allowed_domains = ["example.com"]

    def parse(self, response):
        # skip non-html responses
        content_type = response.headers.get("Content-Type", b"").decode()
        if "text/html" not in content_type:
            return

        # extract page data
        title = response.css("title::text").get("").strip()

        # remove script and style elements, then get text
        body = response.css("body")
        for selector in ["script", "style", "nav", "footer", "header"]:
            body.css(selector).drop()

        text = " ".join(response.css("body *::text").getall())
        text = " ".join(text.split())  # normalize whitespace

        # extract meta description
        meta_desc = response.css(
            'meta[name="description"]::attr(content)'
        ).get("")

        page_id = hashlib.md5(response.url.encode()).hexdigest()

        yield {
            "id": page_id,
            "url": response.url,
            "title": title,
            "meta_description": meta_desc,
            "content": text[:50000],  # cap content length
            "domain": urlparse(response.url).netloc,
        }

        # follow links within allowed domains
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, callback=self.parse)

run the crawler and save results to a JSON Lines file:

scrapy runspider crawler.py -a domains="example.com,docs.example.com" \
  -o crawled_pages.jsonl -t jsonlines

Step 2: Add Proxy Rotation to the Crawler

when crawling multiple sites or large domains, you will likely hit rate limits and IP blocks. adding proxy rotation fixes this problem. update your Scrapy settings:

custom_settings = {
    "DEPTH_LIMIT": 3,
    "DOWNLOAD_DELAY": 1.0,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": True,
    "ROTATING_PROXY_LIST_PATH": "proxies.txt",
    "DOWNLOADER_MIDDLEWARES": {
        "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
        "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
    },
}

create a proxies.txt file with your proxy list:

http://user:pass@proxy1.example.com:8080
http://user:pass@proxy2.example.com:8080
socks5://user:pass@proxy3.example.com:1080

if you are using a proxy provider with a rotating gateway, you only need a single entry:

custom_settings = {
    "HTTP_PROXY": "http://user:pass@gate.provider.com:7777",
    "HTTPS_PROXY": "http://user:pass@gate.provider.com:7777",
}

Step 3: Build the Inverted Index

the inverted index maps every word to the documents that contain it. this is what makes search fast. we will use the Whoosh library, which provides full-text indexing in pure Python:

import os
import json
from whoosh import index
from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh.analysis import StemmingAnalyzer


def create_index(index_dir="search_index"):
    """create the search index schema."""
    schema = Schema(
        doc_id=ID(stored=True, unique=True),
        url=STORED(),
        title=TEXT(stored=True, analyzer=StemmingAnalyzer()),
        meta_description=TEXT(stored=True),
        content=TEXT(analyzer=StemmingAnalyzer()),
        domain=STORED(),
    )

    if not os.path.exists(index_dir):
        os.makedirs(index_dir)

    return index.create_in(index_dir, schema)


def index_documents(jsonl_path, index_dir="search_index"):
    """index all crawled documents."""
    ix = create_index(index_dir)
    writer = ix.writer()

    count = 0
    with open(jsonl_path) as f:
        for line in f:
            doc = json.loads(line)
            writer.add_document(
                doc_id=doc["id"],
                url=doc["url"],
                title=doc["title"],
                meta_description=doc.get("meta_description", ""),
                content=doc["content"],
                domain=doc["domain"],
            )
            count += 1
            if count % 1000 == 0:
                print(f"indexed {count} documents...")

    writer.commit()
    print(f"indexing complete. {count} documents indexed.")
    return ix


if __name__ == "__main__":
    index_documents("crawled_pages.jsonl")

the StemmingAnalyzer handles word stemming automatically, so searching for “running” will also match “runs” and “run.” this significantly improves search quality.

Step 4: Implement Search and Ranking

now you need a search function that queries the index and returns ranked results. Whoosh supports BM25 ranking out of the box:

from whoosh import index
from whoosh.qparser import MultifieldParser, OrGroup
from whoosh.scoring import BM25F


def search(query_string, index_dir="search_index", limit=20):
    """search the index and return ranked results."""
    ix = index.open_dir(index_dir)

    # search across title and content fields
    # title matches get higher weight
    parser = MultifieldParser(
        ["title", "content"],
        schema=ix.schema,
        group=OrGroup,
        fieldboosts={"title": 3.0, "content": 1.0},
    )

    query = parser.parse(query_string)

    results = []
    with ix.searcher(weighting=BM25F()) as searcher:
        hits = searcher.search(query, limit=limit)

        for hit in hits:
            results.append({
                "title": hit["title"],
                "url": hit["url"],
                "domain": hit["domain"],
                "score": hit.score,
                "snippet": hit.highlights("content", top=3),
                "meta_description": hit.get("meta_description", ""),
            })

    return results


if __name__ == "__main__":
    results = search("web scraping python tutorial")
    for i, r in enumerate(results, 1):
        print(f"{i}. [{r['score']:.2f}] {r['title']}")
        print(f"   {r['url']}")
        print(f"   {r['snippet'][:200]}")
        print()

Step 5: Build a Web Interface

a simple Flask app turns your search engine into something you can actually use in a browser:

from flask import Flask, request, render_template_string
from search import search

app = Flask(__name__)

SEARCH_TEMPLATE = """
<!DOCTYPE html>
<html>
<head><title>Custom Search Engine</title></head>
<body>
  <h1>Custom Search</h1>
  <form method="GET" action="/search">
    <input type="text" name="q" value="{{ query }}" size="60">
    <button type="submit">Search</button>
  </form>
  {% if results %}
  <p>{{ results|length }} results for "{{ query }}"</p>
  {% for r in results %}
  <div style="margin-bottom: 20px;">
    <a href="{{ r.url }}"><strong>{{ r.title }}</strong></a>
    <br><small>{{ r.url }} | Score: {{ "%.2f"|format(r.score) }}</small>
    <br>{{ r.meta_description or r.snippet }}
  </div>
  {% endfor %}
  {% endif %}
</body>
</html>
"""

@app.route("/")
def home():
    return render_template_string(SEARCH_TEMPLATE, query="", results=[])

@app.route("/search")
def search_page():
    query = request.args.get("q", "")
    results = search(query) if query else []
    return render_template_string(
        SEARCH_TEMPLATE, query=query, results=results
    )

if __name__ == "__main__":
    app.run(debug=True, port=5000)

Step 6: Add Scheduled Re-crawling

a search engine is only useful if its index stays fresh. set up a scheduled re-crawl using a simple Python script:

import subprocess
import time
from datetime import datetime


def recrawl_and_reindex(domains, interval_hours=24):
    """recrawl target domains and rebuild the index."""
    while True:
        print(f"[{datetime.now()}] starting recrawl...")

        # run the crawler
        subprocess.run([
            "scrapy", "runspider", "crawler.py",
            "-a", f"domains={domains}",
            "-o", "crawled_pages.jsonl",
            "-t", "jsonlines",
            "--set", "LOG_LEVEL=WARNING",
        ])

        # rebuild the index
        from indexer import index_documents
        index_documents("crawled_pages.jsonl")

        print(f"[{datetime.now()}] recrawl complete. "
              f"next run in {interval_hours} hours.")
        time.sleep(interval_hours * 3600)


if __name__ == "__main__":
    recrawl_and_reindex("example.com,docs.example.com")

for production use, replace the while loop with a cron job or a task scheduler like Celery.

Step 7: Improve Ranking Quality

the basic BM25 ranking works well but you can improve it. here are practical enhancements:

URL Depth Scoring

pages closer to the root of a domain tend to be more important:

from urllib.parse import urlparse

def url_depth_score(url):
    """assign higher scores to shallower URLs."""
    path = urlparse(url).path.strip("/")
    depth = len(path.split("/")) if path else 0
    return 1.0 / (1 + depth * 0.3)

Freshness Boost

if you track crawl timestamps, give newer pages a small ranking boost:

from datetime import datetime, timedelta

def freshness_score(crawled_at):
    """boost pages crawled recently."""
    age = datetime.now() - crawled_at
    if age < timedelta(days=7):
        return 1.2
    elif age < timedelta(days=30):
        return 1.0
    else:
        return 0.8

Domain Authority

if you want certain domains to rank higher, add a domain weight:

DOMAIN_WEIGHTS = {
    "docs.python.org": 1.5,
    "stackoverflow.com": 1.3,
    "realpython.com": 1.2,
}

def domain_score(domain):
    return DOMAIN_WEIGHTS.get(domain, 1.0)

combine all three into a final score:

def compute_final_score(bm25_score, url, domain, crawled_at):
    return (
        bm25_score
        * url_depth_score(url)
        * freshness_score(crawled_at)
        * domain_score(domain)
    )

Scaling Considerations

for a search engine indexing tens of thousands of pages, Whoosh works fine. if you need to scale beyond that, consider these alternatives:

  • Elasticsearch or OpenSearch for distributed indexing and search across millions of documents
  • Meilisearch for a lightweight, typo-tolerant search experience
  • SQLite FTS5 if you want something embedded without external dependencies

the crawling layer scales independently. you can run multiple Scrapy spiders in parallel across different domains, each writing to the same output file or a shared queue.

Proxy Strategy for Search Engine Crawling

when crawling for a search engine, you need to think about proxy usage differently than one-off scraping:

  1. use residential proxies for sites with aggressive anti-bot. datacenter proxies work fine for most documentation sites and blogs.
  2. implement per-domain rate limiting. even with proxies, hammering a site with 100 concurrent requests is poor practice and will get your IP ranges flagged.
  3. respect robots.txt. your crawler should honor the crawl-delay directive and disallow rules.
  4. rotate user agents alongside IPs. some sites fingerprint by user agent string, not just IP.
# example proxy configuration for different domain types
PROXY_CONFIG = {
    "default": {
        "proxy": "http://user:pass@dc-proxy.provider.com:8080",
        "delay": 2.0,
    },
    "aggressive_antibot": {
        "proxy": "http://user:pass@residential.provider.com:8080",
        "delay": 3.0,
    },
}

Common Pitfalls

indexing too much noise. if you index entire pages including navigation, footers, and sidebars, your search results will be polluted. strip out non-content elements before indexing.

not handling duplicates. many sites serve the same content at multiple URLs (with and without trailing slashes, with query parameters, etc.). deduplicate by content hash before indexing.

ignoring character encoding. some sites still serve pages in non-UTF-8 encodings. always normalize to UTF-8 before indexing.

not compressing the index. Whoosh indexes can grow large. use the compress option when creating the index for production deployments.

FAQ

how many pages can Whoosh handle?
Whoosh works well up to around 500,000 documents. beyond that, switch to Elasticsearch or Meilisearch.

can I add AI-powered semantic search?
yes. you can generate embeddings for each document using sentence-transformers and store them alongside the text index. query with both keyword and semantic search, then merge the results.

is it legal to crawl websites for a search engine?
generally yes, as long as you respect robots.txt and terms of service. search engine crawling has broad legal precedent. always check the specific site’s terms before crawling.

Conclusion

you now have a working search engine built from scratch with Python. the crawler gathers content, the indexer makes it searchable, and the web interface lets you query everything. with proxy rotation in place, your crawler can operate reliably across many domains without getting blocked.

building a search engine means crawling at scale without getting blocked. our mobile proxy service provides dedicated Singapore carrier IPs for reliable, large-scale crawling.

the complete project involves about 300 lines of Python across four files. from here you can extend it with features like autocomplete, search suggestions, faceted filtering by domain, or AI-powered re-ranking to improve result quality.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top