How to Scrape Industry Forums and Communities for Lead Signals

How to Scrape Industry Forums and Communities for Lead Signals

Industry forums, Reddit communities, Quora threads, and niche online communities are rich with buying intent signals that most B2B sales teams overlook. When a CTO posts on a DevOps forum asking about container orchestration tools, that is a stronger buying signal than any firmographic filter on LinkedIn. When an operations manager describes their pain points with current software on an industry subreddit, that is an invitation for a personalized outreach.

Scraping these communities with mobile proxies lets you identify prospects at the moment of highest intent — when they are actively discussing problems your product solves.

Why Community Data Beats Traditional Lead Sources

Community-sourced leads differ fundamentally from directory-based leads:

AttributeDirectory LeadsCommunity Leads
Intent signalNone (static listing)High (active discussion)
TimingUnknownReal-time
Pain pointsInferredExplicitly stated
CompetitionEveryone has same dataFew teams monitor this
Personalization potentialLowHigh (reference their post)

A prospect who posted “We’re evaluating

for our team of 50″ is orders of magnitude more qualified than a random company matching your ICP filters.

Target Communities by Industry

Technology and SaaS

  • Hacker News (news.ycombinator.com) — CTOs, engineers, founders
  • Reddit — r/devops, r/sysadmin, r/webdev, r/startups, r/SaaS
  • Stack Overflow — Enterprise technology discussions
  • Dev.to — Developer community
  • Indie Hackers — Founders and bootstrappers

Marketing and Sales

  • GrowthHackers — Growth marketing professionals
  • Reddit — r/marketing, r/digital_marketing, r/sales, r/PPC
  • Warrior Forum — Internet marketing
  • Quora — Marketing strategy discussions

Finance and Business

  • Reddit — r/smallbusiness, r/entrepreneur, r/accounting
  • Quora — Business operations topics
  • Industry-specific forums — Varies by vertical

Scraping Reddit for Lead Signals

Reddit is the largest source of community-based lead signals:

import requests
import time
import random
import re
from datetime import datetime

class RedditLeadScraper:
    """Scrape Reddit for B2B lead signals"""

    def __init__(self, proxy_url):
        self.proxy_url = proxy_url
        self.session = requests.Session()
        self.session.proxies = {"https": proxy_url}
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        })

    def search_subreddit(self, subreddit, query, sort="new", time_filter="week", limit=100):
        """Search a subreddit for relevant posts"""
        url = f"https://www.reddit.com/r/{subreddit}/search.json"
        params = {
            "q": query,
            "restrict_sr": "true",
            "sort": sort,
            "t": time_filter,
            "limit": 25,
        }

        all_posts = []
        after = None

        while len(all_posts) < limit:
            if after:
                params["after"] = after

            response = self.session.get(url, params=params, timeout=15)

            if response.status_code == 429:
                time.sleep(60)
                continue

            if response.status_code != 200:
                break

            data = response.json()
            posts = data.get("data", {}).get("children", [])

            if not posts:
                break

            for post in posts:
                post_data = post.get("data", {})
                all_posts.append({
                    "title": post_data.get("title"),
                    "body": post_data.get("selftext", ""),
                    "author": post_data.get("author"),
                    "subreddit": subreddit,
                    "url": f"https://reddit.com{post_data.get('permalink', '')}",
                    "score": post_data.get("score"),
                    "num_comments": post_data.get("num_comments"),
                    "created_utc": post_data.get("created_utc"),
                    "flair": post_data.get("link_flair_text"),
                })

            after = data.get("data", {}).get("after")
            if not after:
                break

            time.sleep(random.uniform(2, 5))

        return all_posts

    def find_buying_signals(self, posts, signal_patterns):
        """Filter posts for buying intent signals"""
        qualified_posts = []

        for post in posts:
            text = f"{post['title']} {post['body']}".lower()
            signals = []

            for category, patterns in signal_patterns.items():
                for pattern in patterns:
                    if re.search(pattern, text, re.IGNORECASE):
                        signals.append(category)
                        break

            if signals:
                post['buying_signals'] = signals
                post['signal_strength'] = len(signals)
                qualified_posts.append(post)

        return sorted(qualified_posts, key=lambda x: x['signal_strength'], reverse=True)


# Define buying signal patterns
BUYING_SIGNALS = {
    "evaluating": [
        r"looking for\s+(?:a|an)\s+\w+\s+(?:tool|solution|platform|software)",
        r"evaluating\s+\w+",
        r"considering\s+(?:switching|migrating|upgrading)",
        r"any\s+recommendations\s+for",
        r"what\s+(?:do you|does everyone)\s+use\s+for",
    ],
    "pain_point": [
        r"frustrated\s+with",
        r"problem\s+with\s+(?:our|my|the)",
        r"struggling\s+(?:to|with)",
        r"(?:current|existing)\s+(?:tool|solution)\s+(?:isn't|doesn't|can't)",
        r"looking\s+for\s+(?:an?\s+)?alternative",
    ],
    "budget_ready": [
        r"budget\s+(?:of|for|around)",
        r"willing\s+to\s+(?:pay|spend|invest)",
        r"pricing\s+(?:for|of|comparison)",
        r"how\s+much\s+(?:does|would|should)",
        r"roi\s+(?:of|from|on)",
    ],
    "team_size": [
        r"team\s+of\s+\d+",
        r"\d+\s+(?:employees|people|users|seats)",
        r"(?:small|medium|large)\s+(?:team|company|organization)",
    ],
}

Scraping Quora for Intent Data

Quora discussions reveal detailed intent signals:

from playwright.async_api import async_playwright

async def scrape_quora_questions(topic, proxy_config, max_questions=50):
    """Scrape Quora for questions related to a topic"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy=proxy_config, headless=False)
        page = await browser.new_page()

        await page.goto(f"https://www.quora.com/search?q={topic}", wait_until="networkidle")
        await page.wait_for_timeout(random.randint(3000, 6000))

        questions = []

        # Scroll to load more results
        for _ in range(10):
            await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(random.randint(2000, 4000))

        # Extract questions
        question_els = await page.query_selector_all('[class*="question_link"]')

        for el in question_els[:max_questions]:
            question = {}
            question['text'] = (await el.inner_text()).strip()
            link = await el.get_attribute('href')
            if link:
                question['url'] = f"https://www.quora.com{link}" if not link.startswith('http') else link

            questions.append(question)

        await browser.close()
        return questions

Hacker News Monitoring

Hacker News is where technology decision-makers discuss tools and challenges:

class HackerNewsScraper:
    """Monitor Hacker News for B2B lead signals"""

    def __init__(self, proxy_url):
        self.proxy_url = proxy_url
        self.api_base = "https://hacker-news.firebaseio.com/v0"
        self.session = requests.Session()
        self.session.proxies = {"https": proxy_url}

    def search_stories(self, query, num_results=50):
        """Search HN stories via Algolia API"""
        response = self.session.get(
            "https://hn.algolia.com/api/v1/search",
            params={
                "query": query,
                "tags": "story",
                "hitsPerPage": num_results,
            },
            timeout=15,
        )

        if response.status_code == 200:
            hits = response.json().get("hits", [])
            return [
                {
                    "title": hit.get("title"),
                    "url": hit.get("url"),
                    "author": hit.get("author"),
                    "points": hit.get("points"),
                    "comments": hit.get("num_comments"),
                    "hn_url": f"https://news.ycombinator.com/item?id={hit.get('objectID')}",
                    "created_at": hit.get("created_at"),
                }
                for hit in hits
            ]
        return []

    def get_ask_hn_posts(self, query):
        """Search Ask HN posts (highest intent)"""
        response = self.session.get(
            "https://hn.algolia.com/api/v1/search",
            params={
                "query": query,
                "tags": "ask_hn",
                "hitsPerPage": 30,
            },
            timeout=15,
        )

        if response.status_code == 200:
            return response.json().get("hits", [])
        return []

    def extract_comments_with_intent(self, story_id):
        """Extract comments from a story, looking for buying intent"""
        response = self.session.get(
            f"https://hn.algolia.com/api/v1/items/{story_id}",
            timeout=15,
        )

        if response.status_code != 200:
            return []

        data = response.json()
        intent_comments = []

        def process_comments(children):
            for child in children:
                text = child.get("text", "")
                author = child.get("author")

                if text and self.has_buying_intent(text):
                    intent_comments.append({
                        "author": author,
                        "text": text[:500],
                        "story_id": story_id,
                    })

                if child.get("children"):
                    process_comments(child["children"])

        if data.get("children"):
            process_comments(data["children"])

        return intent_comments

    def has_buying_intent(self, text):
        """Check if comment shows buying intent"""
        intent_phrases = [
            "we use", "we switched to", "we're looking for",
            "we just migrated", "we evaluated", "I recommend",
            "our team uses", "we've been using",
            "looking for alternatives", "any suggestions for",
        ]
        text_lower = text.lower()
        return any(phrase in text_lower for phrase in intent_phrases)

Identifying the Person Behind the Post

Forum posts are only useful as leads if you can identify the poster. For techniques on linking forum profiles to business identities, understanding web scraping best practices is essential.

class ForumIdentityResolver:
    """Resolve forum usernames to real business identities"""

    def __init__(self, proxy_pool):
        self.proxy_pool = proxy_pool

    async def resolve_reddit_user(self, username):
        """Attempt to identify a Reddit user"""
        proxy = self.proxy_pool.get_next()

        # Check Reddit profile for linked accounts
        response = requests.get(
            f"https://www.reddit.com/user/{username}/about.json",
            proxies={"https": proxy},
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=15,
        )

        identity = {"reddit_username": username}

        if response.status_code == 200:
            data = response.json().get("data", {})
            identity["reddit_karma"] = data.get("link_karma", 0) + data.get("comment_karma", 0)
            identity["reddit_created"] = data.get("created_utc")

        # Search for username on other platforms
        # Many people use the same username across platforms
        identity["possible_matches"] = await self.cross_platform_search(username)

        return identity

    async def cross_platform_search(self, username):
        """Search for username across platforms"""
        matches = []
        proxy = self.proxy_pool.get_next()

        # GitHub
        try:
            response = requests.get(
                f"https://api.github.com/users/{username}",
                proxies={"https": proxy},
                timeout=10,
            )
            if response.status_code == 200:
                gh_data = response.json()
                matches.append({
                    "platform": "github",
                    "name": gh_data.get("name"),
                    "company": gh_data.get("company"),
                    "email": gh_data.get("email"),
                    "url": gh_data.get("html_url"),
                })
        except Exception:
            pass

        return matches

Automated Monitoring Pipeline

Set up continuous monitoring for lead signals:

class CommunityMonitor:
    """Continuously monitor communities for lead signals"""

    def __init__(self, proxy_pool, alert_callback):
        self.proxy_pool = proxy_pool
        self.alert_callback = alert_callback
        self.seen_posts = set()

    def configure_monitors(self, monitors):
        """Configure which communities and keywords to monitor"""
        self.monitors = monitors
        # Example:
        # [
        #     {"type": "reddit", "subreddit": "devops", "keywords": ["CI/CD tool", "deployment automation"]},
        #     {"type": "reddit", "subreddit": "sysadmin", "keywords": ["monitoring solution", "alert fatigue"]},
        #     {"type": "hn", "keywords": ["infrastructure automation"]},
        # ]

    async def run_check(self):
        """Run a single check across all monitored communities"""
        new_leads = []

        for monitor in self.monitors:
            proxy = self.proxy_pool.get_next()

            if monitor["type"] == "reddit":
                scraper = RedditLeadScraper(proxy)
                for keyword in monitor["keywords"]:
                    posts = scraper.search_subreddit(
                        monitor["subreddit"],
                        keyword,
                        time_filter="day",
                    )
                    qualified = scraper.find_buying_signals(posts, BUYING_SIGNALS)

                    for post in qualified:
                        post_id = post.get("url")
                        if post_id not in self.seen_posts:
                            self.seen_posts.add(post_id)
                            new_leads.append(post)

                    time.sleep(random.uniform(3, 8))

            elif monitor["type"] == "hn":
                scraper = HackerNewsScraper(proxy)
                for keyword in monitor["keywords"]:
                    stories = scraper.search_stories(keyword, num_results=20)
                    for story in stories:
                        story_id = story.get("hn_url")
                        if story_id not in self.seen_posts:
                            self.seen_posts.add(story_id)
                            if scraper.has_buying_intent(story.get("title", "")):
                                new_leads.append(story)

        if new_leads:
            self.alert_callback(new_leads)

        return new_leads

Scoring Forum-Sourced Leads

Not all forum signals are equal. Score them by intent strength. For proxy concepts behind the scoring infrastructure, check our proxy glossary.

def score_forum_lead(post):
    """Score a forum-sourced lead by quality and intent"""
    score = 0

    # Signal type scoring
    signal_scores = {
        "evaluating": 30,
        "pain_point": 20,
        "budget_ready": 40,
        "team_size": 15,
    }

    for signal in post.get("buying_signals", []):
        score += signal_scores.get(signal, 5)

    # Recency bonus
    if post.get("created_utc"):
        age_hours = (time.time() - post["created_utc"]) / 3600
        if age_hours < 24:
            score += 20  # Posted today
        elif age_hours < 168:
            score += 10  # Posted this week

    # Engagement indicates real discussion
    if post.get("num_comments", 0) > 5:
        score += 10
    if post.get("score", 0) > 10:
        score += 5

    # Author profile completeness
    if post.get("author") and post["author"] != "[deleted]":
        score += 5

    post["lead_score"] = min(score, 100)
    return post

Conclusion

Forum and community scraping provides the highest-intent B2B lead signals available — prospects actively discussing the exact problems your product solves. While the volume is lower than directory scraping, the conversion rates are dramatically higher because every lead comes with context: their specific pain points, team size, current tools, and evaluation timeline. Mobile proxies ensure reliable access to Reddit, Quora, Hacker News, and niche forums without triggering rate limits. Build automated monitoring across your target communities, score leads by intent strength, and route the highest-scoring signals to your sales team for immediate personalized outreach.


Related Reading

last updated: April 3, 2026

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)