How to Scrape Reddit Posts and Comments at Scale

How to Scrape Reddit Posts and Comments at Scale

Reddit generates over 1.5 billion visits per month and hosts some of the internet’s most candid user-generated content. For market researchers, brand monitoring teams, NLP engineers, and sentiment analysts, Reddit data provides unfiltered consumer opinions that structured surveys cannot capture.

This guide covers two approaches to Reddit data collection: using Reddit’s official API through PRAW, and direct web scraping with proxy rotation for when API limits are not sufficient. You will learn how to extract posts, comments, vote data, and metadata using Python and mobile proxies.

Reddit API vs. Direct Scraping

Before writing any code, understand the tradeoffs between the two primary data collection methods.

Reddit API (via PRAW)

Reddit offers a free API with generous rate limits for individual developers. The PRAW (Python Reddit API Wrapper) library makes this API accessible with clean Python abstractions.

Advantages:

  • Structured JSON responses with consistent schema
  • 60 requests per minute for OAuth-authenticated apps
  • Access to historical data through listing endpoints
  • No risk of IP blocks or CAPTCHAs

Limitations:

  • Maximum 1,000 items per listing (Reddit’s hard cap)
  • No access to deleted or removed content
  • Rate limits become restrictive at scale
  • Recent API pricing changes limit commercial use

Direct Scraping with Proxies

When API limits are insufficient, direct scraping through web scraping proxies removes those ceilings.

Advantages:

  • No hard cap on data volume
  • Can capture rendered page content including awards and flair
  • Access to old.reddit.com’s simpler HTML structure
  • Not subject to API pricing or terms changes

Limitations:

  • Requires proxy rotation to avoid rate limits
  • HTML parsing is more fragile than API responses
  • Requires more infrastructure to operate reliably

Method 1: Scraping with PRAW

Start with the API approach for smaller-scale projects. Install PRAW and set up your credentials:

pip install praw pandas

Create a Reddit app at reddit.com/prefs/apps to get your client ID and secret. Then build the scraper:

import praw
import pandas as pd
import time
from datetime import datetime


class RedditAPIScraper:
    """Collects Reddit data through the official API using PRAW."""

    def __init__(self, client_id, client_secret, user_agent):
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent,
        )

    def scrape_subreddit_posts(self, subreddit_name, sort="hot", limit=500):
        """Scrape posts from a subreddit with metadata."""
        subreddit = self.reddit.subreddit(subreddit_name)

        sort_methods = {
            "hot": subreddit.hot,
            "new": subreddit.new,
            "top": subreddit.top,
            "rising": subreddit.rising,
        }

        posts = []
        fetcher = sort_methods.get(sort, subreddit.hot)

        for submission in fetcher(limit=limit):
            post = {
                "id": submission.id,
                "title": submission.title,
                "author": str(submission.author) if submission.author else "[deleted]",
                "score": submission.score,
                "upvote_ratio": submission.upvote_ratio,
                "num_comments": submission.num_comments,
                "created_utc": datetime.utcfromtimestamp(submission.created_utc).isoformat(),
                "selftext": submission.selftext[:2000] if submission.selftext else None,
                "url": submission.url,
                "permalink": f"https://reddit.com{submission.permalink}",
                "is_self": submission.is_self,
                "link_flair_text": submission.link_flair_text,
                "over_18": submission.over_18,
                "subreddit": subreddit_name,
            }
            posts.append(post)

        return posts

    def scrape_post_comments(self, post_id, depth=None):
        """Scrape all comments from a specific post."""
        submission = self.reddit.submission(id=post_id)
        submission.comments.replace_more(limit=depth)

        comments = []
        for comment in submission.comments.list():
            comment_data = {
                "id": comment.id,
                "post_id": post_id,
                "author": str(comment.author) if comment.author else "[deleted]",
                "body": comment.body,
                "score": comment.score,
                "created_utc": datetime.utcfromtimestamp(comment.created_utc).isoformat(),
                "parent_id": comment.parent_id,
                "is_submitter": comment.is_submitter,
                "depth": comment.depth,
            }
            comments.append(comment_data)

        return comments

    def search_subreddit(self, subreddit_name, query, sort="relevance", limit=250):
        """Search within a subreddit for posts matching a query."""
        subreddit = self.reddit.subreddit(subreddit_name)
        results = []

        for submission in subreddit.search(query, sort=sort, limit=limit):
            result = {
                "id": submission.id,
                "title": submission.title,
                "score": submission.score,
                "num_comments": submission.num_comments,
                "created_utc": datetime.utcfromtimestamp(submission.created_utc).isoformat(),
                "permalink": f"https://reddit.com{submission.permalink}",
                "selftext": submission.selftext[:2000] if submission.selftext else None,
            }
            results.append(result)

        return results

    def scrape_multiple_subreddits(self, subreddit_list, sort="hot", limit=100):
        """Scrape posts from multiple subreddits with rate limit handling."""
        all_posts = []

        for sub_name in subreddit_list:
            try:
                posts = self.scrape_subreddit_posts(sub_name, sort=sort, limit=limit)
                all_posts.extend(posts)
                print(f"r/{sub_name}: {len(posts)} posts collected")
                time.sleep(1)  # Respect rate limits
            except Exception as e:
                print(f"Error scraping r/{sub_name}: {e}")

        return all_posts

Method 2: Direct Scraping with Proxy Rotation

For large-scale collection that exceeds API limits, use direct scraping against old.reddit.com, which has simpler HTML than the redesigned site:

import requests
from bs4 import BeautifulSoup
import random
import time
import json


class RedditDirectScraper:
    """Scrapes Reddit directly using HTTP requests and proxy rotation."""

    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        self.current_proxy_index = 0
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                "AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 "
                "Mobile/15E148 Safari/604.1"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
        })

    def _get_proxy(self):
        """Rotate to the next proxy in the pool."""
        proxy = self.proxy_list[self.current_proxy_index % len(self.proxy_list)]
        self.current_proxy_index += 1
        return {"http": proxy, "https": proxy}

    def _fetch_page(self, url, max_retries=3):
        """Fetch a page with automatic proxy rotation on failure."""
        for attempt in range(max_retries):
            proxy = self._get_proxy()
            try:
                response = self.session.get(
                    url, proxies=proxy, timeout=15
                )
                if response.status_code == 200:
                    return response.text
                elif response.status_code == 429:
                    print(f"Rate limited on {proxy}, rotating proxy...")
                    time.sleep(random.uniform(5, 10))
                else:
                    print(f"HTTP {response.status_code} from {proxy}")
            except requests.RequestException as e:
                print(f"Request failed via {proxy}: {e}")

            time.sleep(random.uniform(1, 3))

        return None

    def scrape_subreddit(self, subreddit, max_pages=10):
        """Scrape posts from a subreddit using old.reddit.com."""
        posts = []
        after = None

        for page in range(max_pages):
            url = f"https://old.reddit.com/r/{subreddit}/.json?limit=100"
            if after:
                url += f"&after={after}"

            html = self._fetch_page(url)
            if not html:
                break

            try:
                data = json.loads(html)
                children = data["data"]["children"]
                after = data["data"]["after"]

                for child in children:
                    post_data = child["data"]
                    post = {
                        "id": post_data["id"],
                        "title": post_data["title"],
                        "author": post_data.get("author", "[deleted]"),
                        "score": post_data["score"],
                        "upvote_ratio": post_data.get("upvote_ratio"),
                        "num_comments": post_data["num_comments"],
                        "created_utc": post_data["created_utc"],
                        "selftext": post_data.get("selftext", "")[:2000],
                        "url": post_data["url"],
                        "permalink": f"https://reddit.com{post_data['permalink']}",
                        "subreddit": subreddit,
                    }
                    posts.append(post)

                print(f"Page {page + 1}: {len(children)} posts (total: {len(posts)})")

                if not after:
                    break

                time.sleep(random.uniform(2, 5))

            except (json.JSONDecodeError, KeyError) as e:
                print(f"Parse error on page {page + 1}: {e}")
                break

        return posts

    def scrape_post_comments(self, permalink, max_depth=5):
        """Scrape all comments from a post using the JSON endpoint."""
        url = f"https://old.reddit.com{permalink}.json?limit=500&depth={max_depth}"
        html = self._fetch_page(url)

        if not html:
            return []

        try:
            data = json.loads(html)
            comments_data = data[1]["data"]["children"]
            return self._parse_comment_tree(comments_data)
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            print(f"Comment parse error: {e}")
            return []

    def _parse_comment_tree(self, children, depth=0):
        """Recursively parse nested comment trees."""
        comments = []

        for child in children:
            if child["kind"] != "t1":
                continue

            comment_data = child["data"]
            comment = {
                "id": comment_data["id"],
                "author": comment_data.get("author", "[deleted]"),
                "body": comment_data.get("body", ""),
                "score": comment_data.get("score", 0),
                "created_utc": comment_data.get("created_utc"),
                "depth": depth,
                "parent_id": comment_data.get("parent_id"),
            }
            comments.append(comment)

            # Recursively parse replies
            replies = comment_data.get("replies")
            if isinstance(replies, dict):
                reply_children = replies.get("data", {}).get("children", [])
                comments.extend(
                    self._parse_comment_tree(reply_children, depth + 1)
                )

        return comments

Scaling Reddit Data Collection

For large-scale projects that need data from hundreds of subreddits, combine both methods with intelligent scheduling:

class RedditScaleCollector:
    """Coordinates large-scale Reddit data collection across methods."""

    def __init__(self, api_scraper, direct_scraper):
        self.api_scraper = api_scraper
        self.direct_scraper = direct_scraper

    def collect_subreddit_data(self, subreddit, target_posts=5000):
        """Collect posts using API first, then switch to direct scraping."""
        # Phase 1: API collection (up to 1000 posts)
        api_posts = self.api_scraper.scrape_subreddit_posts(
            subreddit, sort="new", limit=min(1000, target_posts)
        )
        print(f"API phase: {len(api_posts)} posts from r/{subreddit}")

        if len(api_posts) >= target_posts:
            return api_posts

        # Phase 2: Direct scraping for additional posts
        remaining = target_posts - len(api_posts)
        pages_needed = remaining // 100 + 1

        direct_posts = self.direct_scraper.scrape_subreddit(
            subreddit, max_pages=pages_needed
        )
        print(f"Direct phase: {len(direct_posts)} posts from r/{subreddit}")

        # Deduplicate by post ID
        seen_ids = {p["id"] for p in api_posts}
        unique_direct = [p for p in direct_posts if p["id"] not in seen_ids]

        combined = api_posts + unique_direct
        print(f"Total unique posts: {len(combined)}")
        return combined

    def collect_with_comments(self, subreddit, post_limit=100, comment_depth=3):
        """Collect posts and their full comment trees."""
        posts = self.api_scraper.scrape_subreddit_posts(
            subreddit, sort="hot", limit=post_limit
        )

        for i, post in enumerate(posts):
            try:
                comments = self.direct_scraper.scrape_post_comments(
                    f"/r/{subreddit}/comments/{post['id']}/",
                    max_depth=comment_depth,
                )
                post["comments"] = comments
                print(f"Post {i + 1}/{len(posts)}: {len(comments)} comments")
                time.sleep(random.uniform(1, 3))
            except Exception as e:
                print(f"Error fetching comments for {post['id']}: {e}")
                post["comments"] = []

        return posts

Running the Complete Pipeline

def main():
    # API setup
    api_scraper = RedditAPIScraper(
        client_id="YOUR_CLIENT_ID",
        client_secret="YOUR_CLIENT_SECRET",
        user_agent="DataCollector/1.0",
    )

    # Proxy setup for direct scraping
    proxies = [
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
    ]
    direct_scraper = RedditDirectScraper(proxies)

    # Collect data from target subreddits
    collector = RedditScaleCollector(api_scraper, direct_scraper)

    target_subreddits = [
        "webdev", "datascience", "machinelearning",
        "python", "programming",
    ]

    all_data = {}
    for sub in target_subreddits:
        data = collector.collect_with_comments(sub, post_limit=50, comment_depth=3)
        all_data[sub] = data
        print(f"\nr/{sub}: {len(data)} posts with comments\n")

    # Export per subreddit
    for sub, posts in all_data.items():
        df = pd.DataFrame(posts)
        df.to_csv(f"reddit_{sub}_posts.csv", index=False)

        # Also export comments separately
        all_comments = []
        for post in posts:
            for comment in post.get("comments", []):
                comment["post_id"] = post["id"]
                comment["post_title"] = post["title"]
                all_comments.append(comment)

        if all_comments:
            pd.DataFrame(all_comments).to_csv(
                f"reddit_{sub}_comments.csv", index=False
            )

    print(f"\nTotal subreddits processed: {len(all_data)}")


if __name__ == "__main__":
    main()

Rate Limit Management

Reddit’s rate limiting operates at multiple levels:

  • API level: 60 requests per minute per OAuth token
  • IP level: Approximately 30 requests per minute for unauthenticated JSON endpoints
  • Account level: Temporary suspensions for excessive automated activity

When using proxies for direct scraping, distribute requests across your proxy pool to stay under the per-IP limit. With a pool of 10 mobile proxies, you can effectively make 300 requests per minute without triggering any single IP’s rate limit.

Handling Reddit-Specific Challenges

Deleted and removed content. Reddit marks deleted user content as [deleted] and moderator-removed content as [removed]. Track these separately in your dataset as they may still carry metadata value.

Vote score fuzzing. Reddit intentionally fuzzes vote counts to prevent manipulation. For statistical analysis, treat vote scores as approximate ordinal rankings rather than exact counts.

Nested comment depth. Reddit allows deeply nested comment threads. Set a reasonable depth limit (5-7 levels) to avoid excessive request volume on deeply threaded discussions.

Subreddit restrictions. Some subreddits are private, quarantined, or age-restricted. Handle these gracefully with appropriate error handling rather than letting them crash your scraper.

Conclusion

Reddit provides some of the richest unfiltered consumer opinion data available on the internet. The combination of PRAW for structured API access and direct scraping with proxy rotation for scale gives you the flexibility to handle projects of any size.

For related scraping guides, explore our social media proxy tutorials and web scraping proxy resources. If you are new to proxy terminology, the proxy glossary explains key concepts like proxy rotation and session management.


Related Reading

Related: For account-safety-first picks, see the best proxies for Reddit in 2026.

Scroll to Top