Build a Search Engine with Web Scraping
building your own search engine sounds like a massive undertaking. in reality, you can build a functional search engine over a specific domain or niche in a weekend using Python, web scraping, and a few well-chosen libraries. this guide walks you through the entire process from crawling to indexing to querying.
the result will be a vertical search engine that crawls specific websites, indexes their content, and lets you run full-text queries against everything it has collected. you will also learn how to integrate proxy rotation so your crawler can operate reliably at scale without getting blocked.
Why Build a Custom Search Engine
Google is great for general search, but it falls short when you need to search across a specific set of sites or data sources. a custom search engine gives you full control over what gets indexed, how results are ranked, and what metadata you capture.
common use cases include:
- searching across competitor product catalogs
- building an internal knowledge base from public documentation
- creating a research tool that indexes academic sources
- monitoring specific forums or communities for mentions of your brand
Architecture Overview
our search engine has four main components:
┌──────────────────────┐
│ Query Interface │ ← Flask web app
├──────────────────────┤
│ Search/Ranking │ ← TF-IDF + BM25
├──────────────────────┤
│ Inverted Index │ ← Whoosh or custom
├──────────────────────┤
│ Web Crawler │ ← Scrapy + proxies
└──────────────────────┘
the crawler feeds pages into the indexer, the indexer builds a searchable index, and the query interface lets users search and get ranked results.
Prerequisites
you will need Python 3.10 or later and the following packages:
pip install scrapy beautifulsoup4 whoosh flask requests lxml
for proxy integration, you will also want:
pip install scrapy-rotating-proxies
Step 1: Build the Web Crawler
the crawler is the foundation. it visits pages, extracts text content, and stores it for indexing. here is a Scrapy spider that crawls a target domain:
import scrapy
from urllib.parse import urlparse
import json
import hashlib
class SearchCrawler(scrapy.Spider):
name = "search_crawler"
custom_settings = {
"DEPTH_LIMIT": 3,
"DOWNLOAD_DELAY": 1.5,
"CONCURRENT_REQUESTS": 8,
"ROBOTSTXT_OBEY": True,
"USER_AGENT": "SearchBot/1.0 (+https://yourdomain.com/bot)",
}
def __init__(self, domains=None, *args, **kwargs):
super().__init__(*args, **kwargs)
if domains:
self.start_urls = [f"https://{d}" for d in domains.split(",")]
self.allowed_domains = domains.split(",")
else:
self.start_urls = ["https://example.com"]
self.allowed_domains = ["example.com"]
def parse(self, response):
# skip non-html responses
content_type = response.headers.get("Content-Type", b"").decode()
if "text/html" not in content_type:
return
# extract page data
title = response.css("title::text").get("").strip()
# remove script and style elements, then get text
body = response.css("body")
for selector in ["script", "style", "nav", "footer", "header"]:
body.css(selector).drop()
text = " ".join(response.css("body *::text").getall())
text = " ".join(text.split()) # normalize whitespace
# extract meta description
meta_desc = response.css(
'meta[name="description"]::attr(content)'
).get("")
page_id = hashlib.md5(response.url.encode()).hexdigest()
yield {
"id": page_id,
"url": response.url,
"title": title,
"meta_description": meta_desc,
"content": text[:50000], # cap content length
"domain": urlparse(response.url).netloc,
}
# follow links within allowed domains
for href in response.css("a::attr(href)").getall():
yield response.follow(href, callback=self.parse)
run the crawler and save results to a JSON Lines file:
scrapy runspider crawler.py -a domains="example.com,docs.example.com" \
-o crawled_pages.jsonl -t jsonlines
Step 2: Add Proxy Rotation to the Crawler
when crawling multiple sites or large domains, you will likely hit rate limits and IP blocks. adding proxy rotation fixes this problem. update your Scrapy settings:
custom_settings = {
"DEPTH_LIMIT": 3,
"DOWNLOAD_DELAY": 1.0,
"CONCURRENT_REQUESTS": 16,
"ROBOTSTXT_OBEY": True,
"ROTATING_PROXY_LIST_PATH": "proxies.txt",
"DOWNLOADER_MIDDLEWARES": {
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
},
}
create a proxies.txt file with your proxy list:
http://user:pass@proxy1.example.com:8080
http://user:pass@proxy2.example.com:8080
socks5://user:pass@proxy3.example.com:1080
if you are using a proxy provider with a rotating gateway, you only need a single entry:
custom_settings = {
"HTTP_PROXY": "http://user:pass@gate.provider.com:7777",
"HTTPS_PROXY": "http://user:pass@gate.provider.com:7777",
}
Step 3: Build the Inverted Index
the inverted index maps every word to the documents that contain it. this is what makes search fast. we will use the Whoosh library, which provides full-text indexing in pure Python:
import os
import json
from whoosh import index
from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh.analysis import StemmingAnalyzer
def create_index(index_dir="search_index"):
"""create the search index schema."""
schema = Schema(
doc_id=ID(stored=True, unique=True),
url=STORED(),
title=TEXT(stored=True, analyzer=StemmingAnalyzer()),
meta_description=TEXT(stored=True),
content=TEXT(analyzer=StemmingAnalyzer()),
domain=STORED(),
)
if not os.path.exists(index_dir):
os.makedirs(index_dir)
return index.create_in(index_dir, schema)
def index_documents(jsonl_path, index_dir="search_index"):
"""index all crawled documents."""
ix = create_index(index_dir)
writer = ix.writer()
count = 0
with open(jsonl_path) as f:
for line in f:
doc = json.loads(line)
writer.add_document(
doc_id=doc["id"],
url=doc["url"],
title=doc["title"],
meta_description=doc.get("meta_description", ""),
content=doc["content"],
domain=doc["domain"],
)
count += 1
if count % 1000 == 0:
print(f"indexed {count} documents...")
writer.commit()
print(f"indexing complete. {count} documents indexed.")
return ix
if __name__ == "__main__":
index_documents("crawled_pages.jsonl")
the StemmingAnalyzer handles word stemming automatically, so searching for “running” will also match “runs” and “run.” this significantly improves search quality.
Step 4: Implement Search and Ranking
now you need a search function that queries the index and returns ranked results. Whoosh supports BM25 ranking out of the box:
from whoosh import index
from whoosh.qparser import MultifieldParser, OrGroup
from whoosh.scoring import BM25F
def search(query_string, index_dir="search_index", limit=20):
"""search the index and return ranked results."""
ix = index.open_dir(index_dir)
# search across title and content fields
# title matches get higher weight
parser = MultifieldParser(
["title", "content"],
schema=ix.schema,
group=OrGroup,
fieldboosts={"title": 3.0, "content": 1.0},
)
query = parser.parse(query_string)
results = []
with ix.searcher(weighting=BM25F()) as searcher:
hits = searcher.search(query, limit=limit)
for hit in hits:
results.append({
"title": hit["title"],
"url": hit["url"],
"domain": hit["domain"],
"score": hit.score,
"snippet": hit.highlights("content", top=3),
"meta_description": hit.get("meta_description", ""),
})
return results
if __name__ == "__main__":
results = search("web scraping python tutorial")
for i, r in enumerate(results, 1):
print(f"{i}. [{r['score']:.2f}] {r['title']}")
print(f" {r['url']}")
print(f" {r['snippet'][:200]}")
print()
Step 5: Build a Web Interface
a simple Flask app turns your search engine into something you can actually use in a browser:
from flask import Flask, request, render_template_string
from search import search
app = Flask(__name__)
SEARCH_TEMPLATE = """
<!DOCTYPE html>
<html>
<head><title>Custom Search Engine</title></head>
<body>
<h1>Custom Search</h1>
<form method="GET" action="/search">
<input type="text" name="q" value="{{ query }}" size="60">
<button type="submit">Search</button>
</form>
{% if results %}
<p>{{ results|length }} results for "{{ query }}"</p>
{% for r in results %}
<div style="margin-bottom: 20px;">
<a href="{{ r.url }}"><strong>{{ r.title }}</strong></a>
<br><small>{{ r.url }} | Score: {{ "%.2f"|format(r.score) }}</small>
<br>{{ r.meta_description or r.snippet }}
</div>
{% endfor %}
{% endif %}
</body>
</html>
"""
@app.route("/")
def home():
return render_template_string(SEARCH_TEMPLATE, query="", results=[])
@app.route("/search")
def search_page():
query = request.args.get("q", "")
results = search(query) if query else []
return render_template_string(
SEARCH_TEMPLATE, query=query, results=results
)
if __name__ == "__main__":
app.run(debug=True, port=5000)
Step 6: Add Scheduled Re-crawling
a search engine is only useful if its index stays fresh. set up a scheduled re-crawl using a simple Python script:
import subprocess
import time
from datetime import datetime
def recrawl_and_reindex(domains, interval_hours=24):
"""recrawl target domains and rebuild the index."""
while True:
print(f"[{datetime.now()}] starting recrawl...")
# run the crawler
subprocess.run([
"scrapy", "runspider", "crawler.py",
"-a", f"domains={domains}",
"-o", "crawled_pages.jsonl",
"-t", "jsonlines",
"--set", "LOG_LEVEL=WARNING",
])
# rebuild the index
from indexer import index_documents
index_documents("crawled_pages.jsonl")
print(f"[{datetime.now()}] recrawl complete. "
f"next run in {interval_hours} hours.")
time.sleep(interval_hours * 3600)
if __name__ == "__main__":
recrawl_and_reindex("example.com,docs.example.com")
for production use, replace the while loop with a cron job or a task scheduler like Celery.
Step 7: Improve Ranking Quality
the basic BM25 ranking works well but you can improve it. here are practical enhancements:
URL Depth Scoring
pages closer to the root of a domain tend to be more important:
from urllib.parse import urlparse
def url_depth_score(url):
"""assign higher scores to shallower URLs."""
path = urlparse(url).path.strip("/")
depth = len(path.split("/")) if path else 0
return 1.0 / (1 + depth * 0.3)
Freshness Boost
if you track crawl timestamps, give newer pages a small ranking boost:
from datetime import datetime, timedelta
def freshness_score(crawled_at):
"""boost pages crawled recently."""
age = datetime.now() - crawled_at
if age < timedelta(days=7):
return 1.2
elif age < timedelta(days=30):
return 1.0
else:
return 0.8
Domain Authority
if you want certain domains to rank higher, add a domain weight:
DOMAIN_WEIGHTS = {
"docs.python.org": 1.5,
"stackoverflow.com": 1.3,
"realpython.com": 1.2,
}
def domain_score(domain):
return DOMAIN_WEIGHTS.get(domain, 1.0)
combine all three into a final score:
def compute_final_score(bm25_score, url, domain, crawled_at):
return (
bm25_score
* url_depth_score(url)
* freshness_score(crawled_at)
* domain_score(domain)
)
Scaling Considerations
for a search engine indexing tens of thousands of pages, Whoosh works fine. if you need to scale beyond that, consider these alternatives:
- Elasticsearch or OpenSearch for distributed indexing and search across millions of documents
- Meilisearch for a lightweight, typo-tolerant search experience
- SQLite FTS5 if you want something embedded without external dependencies
the crawling layer scales independently. you can run multiple Scrapy spiders in parallel across different domains, each writing to the same output file or a shared queue.
Proxy Strategy for Search Engine Crawling
when crawling for a search engine, you need to think about proxy usage differently than one-off scraping:
- use residential proxies for sites with aggressive anti-bot. datacenter proxies work fine for most documentation sites and blogs.
- implement per-domain rate limiting. even with proxies, hammering a site with 100 concurrent requests is poor practice and will get your IP ranges flagged.
- respect robots.txt. your crawler should honor the crawl-delay directive and disallow rules.
- rotate user agents alongside IPs. some sites fingerprint by user agent string, not just IP.
# example proxy configuration for different domain types
PROXY_CONFIG = {
"default": {
"proxy": "http://user:pass@dc-proxy.provider.com:8080",
"delay": 2.0,
},
"aggressive_antibot": {
"proxy": "http://user:pass@residential.provider.com:8080",
"delay": 3.0,
},
}
Common Pitfalls
indexing too much noise. if you index entire pages including navigation, footers, and sidebars, your search results will be polluted. strip out non-content elements before indexing.
not handling duplicates. many sites serve the same content at multiple URLs (with and without trailing slashes, with query parameters, etc.). deduplicate by content hash before indexing.
ignoring character encoding. some sites still serve pages in non-UTF-8 encodings. always normalize to UTF-8 before indexing.
not compressing the index. Whoosh indexes can grow large. use the compress option when creating the index for production deployments.
FAQ
how many pages can Whoosh handle?
Whoosh works well up to around 500,000 documents. beyond that, switch to Elasticsearch or Meilisearch.
can I add AI-powered semantic search?
yes. you can generate embeddings for each document using sentence-transformers and store them alongside the text index. query with both keyword and semantic search, then merge the results.
is it legal to crawl websites for a search engine?
generally yes, as long as you respect robots.txt and terms of service. search engine crawling has broad legal precedent. always check the specific site’s terms before crawling.
Conclusion
you now have a working search engine built from scratch with Python. the crawler gathers content, the indexer makes it searchable, and the web interface lets you query everything. with proxy rotation in place, your crawler can operate reliably across many domains without getting blocked.
building a search engine means crawling at scale without getting blocked. our mobile proxy service provides dedicated Singapore carrier IPs for reliable, large-scale crawling.
the complete project involves about 300 lines of Python across four files. from here you can extend it with features like autocomplete, search suggestions, faceted filtering by domain, or AI-powered re-ranking to improve result quality.