How to Scrape YouTube Search Results and Video Metadata

How to Scrape YouTube Search Results and Video Metadata

YouTube is the second-largest search engine in the world and the dominant video platform, hosting over 800 million videos with 500 hours of new content uploaded every minute. For marketers, researchers, and content strategists, YouTube data provides critical insights into audience interests, content performance, trending topics, and competitor strategies.

While YouTube offers an official Data API, its quotas are restrictive and its scope limited. Scraping YouTube directly allows you to collect data at a scale and depth that the API does not support. This guide walks through building a YouTube scraper in Python using rotating proxies for reliability and scale.

Why You Need Proxies for YouTube Scraping

Google, which owns YouTube, operates one of the most advanced anti-bot detection systems in existence. YouTube scraping without proxies will result in:

  • CAPTCHA challenges after just a few dozen requests.
  • IP-level throttling that slows responses to a crawl.
  • Temporary and permanent IP bans that block all Google services from your address.
  • Altered search results served to detected automated systems.

Mobile proxies are particularly effective for YouTube because a significant portion of YouTube’s legitimate traffic comes from mobile devices. Mobile carrier IPs blend naturally with normal usage patterns.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml pandas yt-dlp

We include yt-dlp as an optional tool for extracting metadata that is difficult to parse from HTML alone.

Building the YouTube Scraper

Step 1: Configure Session and Proxy

import requests
from bs4 import BeautifulSoup
import json
import time
import random
import re
import pandas as pd
from datetime import datetime
from urllib.parse import quote_plus, urlencode

class YouTubeScraper:
    """Scrape YouTube search results and video metadata."""

    SEARCH_URL = "https://www.youtube.com/results"
    VIDEO_URL = "https://www.youtube.com/watch"
    CHANNEL_URL = "https://www.youtube.com"

    def __init__(self, proxy_url):
        self.session = requests.Session()
        self.session.proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
        })

    def _fetch_page(self, url, params=None, max_retries=3):
        """Fetch a page with retry logic."""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, params=params, timeout=25)

                if response.status_code == 200:
                    if "captcha" in response.text.lower() or "unusual traffic" in response.text.lower():
                        print(f"CAPTCHA detected, attempt {attempt + 1}")
                        time.sleep(random.uniform(15, 30))
                        continue
                    return response.text
                elif response.status_code == 429:
                    print(f"Rate limited, waiting...")
                    time.sleep(random.uniform(30, 60))
                else:
                    print(f"Status {response.status_code}, attempt {attempt + 1}")
                    time.sleep(random.uniform(5, 10))

            except requests.exceptions.RequestException as e:
                print(f"Request error: {e}")
                time.sleep(random.uniform(5, 10))

        return None

    def _extract_initial_data(self, html):
        """Extract the ytInitialData JSON from YouTube's HTML."""
        patterns = [
            r'var ytInitialData = ({.*?});</script>',
            r'window\["ytInitialData"\] = ({.*?});</script>',
            r'ytInitialData\s*=\s*({.*?});\s*</script>',
        ]

        for pattern in patterns:
            match = re.search(pattern, html, re.DOTALL)
            if match:
                try:
                    return json.loads(match.group(1))
                except json.JSONDecodeError:
                    continue

        return None

Step 2: Scrape Search Results

    def search_videos(self, query, max_results=50):
        """Search YouTube and extract video results."""
        params = {"search_query": query}
        html = self._fetch_page(self.SEARCH_URL, params=params)
        if not html:
            return []

        data = self._extract_initial_data(html)
        if not data:
            print("Failed to extract initial data from search page")
            return []

        videos = []

        # Navigate the nested JSON structure
        try:
            contents = (
                data["contents"]["twoColumnSearchResultsRenderer"]
                ["primaryContents"]["sectionListRenderer"]["contents"]
            )

            for section in contents:
                items = (
                    section.get("itemSectionRenderer", {})
                    .get("contents", [])
                )

                for item in items:
                    renderer = item.get("videoRenderer")
                    if not renderer:
                        continue

                    video = self._parse_video_renderer(renderer)
                    if video:
                        videos.append(video)

                    if len(videos) >= max_results:
                        break

        except (KeyError, TypeError) as e:
            print(f"Error parsing search results: {e}")

        return videos

    def _parse_video_renderer(self, renderer):
        """Parse a videoRenderer object into structured data."""
        video = {}

        # Video ID
        video["video_id"] = renderer.get("videoId")
        video["url"] = f"https://www.youtube.com/watch?v={video['video_id']}"

        # Title
        title_runs = renderer.get("title", {}).get("runs", [])
        video["title"] = "".join(run.get("text", "") for run in title_runs)

        # Channel
        channel_runs = (
            renderer.get("ownerText", {}).get("runs", [])
        )
        if channel_runs:
            video["channel_name"] = channel_runs[0].get("text")
            nav = channel_runs[0].get("navigationEndpoint", {})
            channel_url = nav.get("browseEndpoint", {}).get("canonicalBaseUrl")
            video["channel_url"] = f"https://www.youtube.com{channel_url}" if channel_url else None

        # View count
        view_text = renderer.get("viewCountText", {}).get("simpleText", "")
        video["views_text"] = view_text
        view_match = re.search(r"([\d,]+)", view_text.replace(",", ""))
        video["views"] = int(view_match.group(1)) if view_match else None

        # Published date
        published = renderer.get("publishedTimeText", {}).get("simpleText", "")
        video["published_text"] = published

        # Duration
        length_text = (
            renderer.get("lengthText", {}).get("simpleText", "")
        )
        video["duration"] = length_text

        # Thumbnail
        thumbnails = renderer.get("thumbnail", {}).get("thumbnails", [])
        video["thumbnail"] = thumbnails[-1].get("url") if thumbnails else None

        # Description snippet
        desc_snippets = renderer.get("detailedMetadataSnippets", [])
        if desc_snippets:
            snippet_runs = desc_snippets[0].get("snippetText", {}).get("runs", [])
            video["description_snippet"] = "".join(
                run.get("text", "") for run in snippet_runs
            )

        return video if video.get("video_id") else None

Step 3: Extract Detailed Video Metadata

    def get_video_details(self, video_id):
        """Fetch detailed metadata for a single video."""
        params = {"v": video_id}
        html = self._fetch_page(self.VIDEO_URL, params=params)
        if not html:
            return None

        data = self._extract_initial_data(html)
        if not data:
            return None

        details = {"video_id": video_id}

        try:
            # Extract from videoPrimaryInfoRenderer and videoSecondaryInfoRenderer
            results = (
                data["contents"]["twoColumnWatchNextResults"]["results"]
                ["results"]["contents"]
            )

            for content in results:
                # Primary info (title, views, date, likes)
                primary = content.get("videoPrimaryInfoRenderer")
                if primary:
                    # Title
                    title_runs = primary.get("title", {}).get("runs", [])
                    details["title"] = "".join(r.get("text", "") for r in title_runs)

                    # View count
                    view_count = primary.get("viewCount", {}).get(
                        "videoViewCountRenderer", {}
                    )
                    view_text = view_count.get("viewCount", {}).get("simpleText", "")
                    details["views_text"] = view_text
                    view_match = re.search(r"([\d,]+)", view_text.replace(",", ""))
                    details["views"] = int(view_match.group(1)) if view_match else None

                    # Date
                    date_text = primary.get("dateText", {}).get("simpleText", "")
                    details["date_text"] = date_text

                    # Likes (from toggle buttons)
                    buttons = primary.get("videoActions", {}).get(
                        "menuRenderer", {}
                    ).get("topLevelButtons", [])
                    for btn in buttons:
                        toggle = btn.get("segmentedLikeDislikeButtonViewModel", {})
                        like_btn = toggle.get("likeButtonViewModel", {}).get(
                            "likeButtonViewModel", {}
                        )
                        like_count = like_btn.get("toggleButtonViewModel", {}).get(
                            "toggleButtonViewModel", {}
                        ).get("defaultButtonViewModel", {}).get(
                            "buttonViewModel", {}
                        ).get("title", "")
                        if like_count:
                            details["likes_text"] = like_count

                # Secondary info (channel, description, subscribers)
                secondary = content.get("videoSecondaryInfoRenderer")
                if secondary:
                    # Channel info
                    channel_data = secondary.get("owner", {}).get(
                        "videoOwnerRenderer", {}
                    )
                    title_data = channel_data.get("title", {}).get("runs", [])
                    if title_data:
                        details["channel_name"] = title_data[0].get("text")

                    sub_text = channel_data.get("subscriberCountText", {}).get(
                        "simpleText", ""
                    )
                    details["subscribers_text"] = sub_text

                    # Description
                    desc = secondary.get("attributedDescription", {}).get("content", "")
                    details["description"] = desc

            # Extract tags from the HTML meta tags
            soup = BeautifulSoup(html, "lxml")
            meta_keywords = soup.find("meta", {"name": "keywords"})
            if meta_keywords:
                details["tags"] = meta_keywords.get("content", "").split(",")

            # Category
            meta_genre = soup.find("meta", {"itemprop": "genre"})
            if meta_genre:
                details["category"] = meta_genre.get("content")

        except (KeyError, TypeError) as e:
            print(f"Error parsing video details: {e}")

        return details

Step 4: Scrape Channel Information

    def get_channel_info(self, channel_handle):
        """Fetch channel information and recent videos."""
        url = f"{self.CHANNEL_URL}/{channel_handle}"
        html = self._fetch_page(url)
        if not html:
            return None

        data = self._extract_initial_data(html)
        if not data:
            return None

        channel = {"handle": channel_handle}

        try:
            # Channel metadata
            metadata = data.get("metadata", {}).get("channelMetadataRenderer", {})
            channel["title"] = metadata.get("title")
            channel["description"] = metadata.get("description")
            channel["keywords"] = metadata.get("keywords")
            channel["external_url"] = metadata.get("vanityChannelUrl")

            # Channel header for subscriber count and avatar
            header = data.get("header", {}).get("c4TabbedHeaderRenderer", {})
            sub_text = header.get("subscriberCountText", {}).get("simpleText", "")
            channel["subscribers_text"] = sub_text

            avatar_thumbs = header.get("avatar", {}).get("thumbnails", [])
            channel["avatar_url"] = avatar_thumbs[-1].get("url") if avatar_thumbs else None

            # Banner
            banner_thumbs = header.get("banner", {}).get("thumbnails", [])
            channel["banner_url"] = banner_thumbs[-1].get("url") if banner_thumbs else None

            # Recent videos from the Videos tab
            tabs = data.get("contents", {}).get(
                "twoColumnBrowseResultsRenderer", {}
            ).get("tabs", [])

            for tab in tabs:
                tab_renderer = tab.get("tabRenderer", {})
                if tab_renderer.get("title") == "Videos":
                    grid_items = (
                        tab_renderer.get("content", {})
                        .get("richGridRenderer", {})
                        .get("contents", [])
                    )
                    channel["recent_videos"] = []
                    for item in grid_items[:10]:
                        vid_renderer = (
                            item.get("richItemRenderer", {})
                            .get("content", {})
                            .get("videoRenderer")
                        )
                        if vid_renderer:
                            video = self._parse_video_renderer(vid_renderer)
                            if video:
                                channel["recent_videos"].append(video)

        except (KeyError, TypeError) as e:
            print(f"Error parsing channel info: {e}")

        return channel

Step 5: Run the Complete Pipeline

def main():
    proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
    scraper = YouTubeScraper(proxy_url)

    # Search for videos on multiple topics
    search_queries = [
        "python web scraping tutorial 2026",
        "best proxies for web scraping",
        "data science project ideas",
    ]

    all_search_results = {}
    for query in search_queries:
        print(f"\nSearching: '{query}'")
        results = scraper.search_videos(query, max_results=20)
        all_search_results[query] = results
        print(f"Found {len(results)} videos")
        time.sleep(random.uniform(5, 10))

    # Get detailed info for top videos
    detailed_videos = []
    for query, results in all_search_results.items():
        for video in results[:5]:  # Top 5 per query
            print(f"Fetching details: {video['title'][:50]}...")
            details = scraper.get_video_details(video["video_id"])
            if details:
                details["search_query"] = query
                detailed_videos.append(details)
            time.sleep(random.uniform(3, 7))

    # Scrape channel info
    channels_to_scrape = ["@TechWithTim", "@Fireship", "@NetworkChuck"]
    channel_data = []
    for handle in channels_to_scrape:
        print(f"\nScraping channel: {handle}")
        info = scraper.get_channel_info(handle)
        if info:
            channel_data.append(info)
            print(f"  Name: {info.get('title')}, Subs: {info.get('subscribers_text')}")
        time.sleep(random.uniform(5, 10))

    # Save all data
    output = {
        "search_results": all_search_results,
        "detailed_videos": detailed_videos,
        "channels": channel_data,
        "scraped_at": datetime.now().isoformat(),
    }

    with open("youtube_data.json", "w", encoding="utf-8") as f:
        json.dump(output, f, indent=2, ensure_ascii=False)

    # Create analysis
    all_videos = []
    for results in all_search_results.values():
        all_videos.extend(results)

    df = pd.DataFrame(all_videos)
    df.to_csv("youtube_search_results.csv", index=False)

    print(f"\nTotal videos scraped: {len(all_videos)}")
    print(f"Detailed videos: {len(detailed_videos)}")
    print(f"Channels analyzed: {len(channel_data)}")


if __name__ == "__main__":
    main()

Using yt-dlp for Metadata Extraction

For scenarios where HTML parsing is unreliable, yt-dlp provides a robust alternative for extracting video metadata:

import subprocess
import json

def get_metadata_via_ytdlp(video_url, proxy_url=None):
    """Extract video metadata using yt-dlp."""
    cmd = [
        "yt-dlp",
        "--dump-json",
        "--no-download",
        "--no-warnings",
        video_url,
    ]

    if proxy_url:
        cmd.extend(["--proxy", proxy_url])

    try:
        result = subprocess.run(
            cmd, capture_output=True, text=True, timeout=30
        )
        if result.returncode == 0:
            data = json.loads(result.stdout)
            return {
                "title": data.get("title"),
                "views": data.get("view_count"),
                "likes": data.get("like_count"),
                "duration": data.get("duration"),
                "upload_date": data.get("upload_date"),
                "channel": data.get("channel"),
                "channel_id": data.get("channel_id"),
                "description": data.get("description"),
                "tags": data.get("tags"),
                "categories": data.get("categories"),
                "comment_count": data.get("comment_count"),
                "subscriber_count": data.get("channel_follower_count"),
            }
    except (subprocess.TimeoutExpired, json.JSONDecodeError) as e:
        print(f"yt-dlp error: {e}")

    return None

Anti-Detection Strategies for YouTube

Request Pacing

Google monitors request patterns with extreme precision. For YouTube scraping:

  • Space search queries 5-10 seconds apart.
  • Wait 3-7 seconds between video detail pages.
  • Add longer pauses (15-30 seconds) after every 20-30 requests.
  • Avoid scraping during known Google maintenance windows.

Header and Cookie Management

YouTube sets consent cookies and regional preferences. Always handle these properly:

def initialize_youtube_session(scraper):
    """Set up cookies and consent for YouTube access."""
    # Accept cookie consent
    scraper.session.cookies.set("CONSENT", "YES+cb", domain=".youtube.com")
    # Set preferred language
    scraper.session.cookies.set("PREF", "hl=en&gl=US", domain=".youtube.com")
    # Warm up session
    scraper._fetch_page("https://www.youtube.com/")
    time.sleep(random.uniform(2, 4))

Proxy Selection

For YouTube specifically, mobile proxies outperform residential proxies because YouTube’s mobile traffic volume is enormous. Google expects and accommodates mobile carrier IP traffic patterns.

Applications for YouTube Data

YouTube data collection powers numerous business applications:

  • Content strategy: Analyze what video formats, lengths, and topics perform best in your niche.
  • Competitor monitoring: Track competitor upload frequency, view growth, and engagement trends.
  • Ad intelligence: Monitor which ads appear on specific keywords or channels.
  • SEO research: YouTube is the second-largest search engine — understanding YouTube search rankings drives video SEO strategy.
  • Trend detection: Identify emerging topics by monitoring rapidly growing videos and channels.
  • Influencer vetting: Evaluate potential partners by analyzing their content quality, engagement rates, and audience demographics.

Conclusion

Scraping YouTube search results and video metadata opens up a wealth of data for content strategy, competitive analysis, and market research. The combination of YouTube’s embedded JSON data and supplementary tools like yt-dlp creates a robust extraction pipeline.

The foundation of successful YouTube scraping is reliable proxy infrastructure. Mobile proxies from DataResearchTools provide the trusted IP addresses that Google’s systems expect from legitimate YouTube users. For more web scraping tutorials and proxy concepts, explore our proxy glossary.


Related Reading

Scroll to Top