How to Scrape Instagram Profiles and Posts Without Getting Blocked

How to Scrape Instagram Profiles and Posts Without Getting Blocked

Instagram is one of the most data-rich social platforms in existence. With over two billion monthly active users, it serves as a critical data source for influencer marketing, brand monitoring, competitive analysis, and trend research. The challenge lies in extracting this data without triggering Instagram’s formidable anti-bot defenses.

In this comprehensive guide, we walk through building an Instagram scraper in Python that leverages mobile proxies to remain undetected while extracting profiles, posts, hashtags, and engagement metrics at scale.

Understanding Instagram’s Detection Systems

Instagram, owned by Meta, has invested heavily in anti-automation technology. Their detection systems operate on multiple levels:

IP-Level Detection

Instagram maintains extensive blocklists of known datacenter IP ranges and VPN endpoints. They monitor request frequency per IP and flag addresses that exceed normal browsing patterns. This is why residential and mobile proxies are essential — they use IP addresses assigned to real internet subscribers.

Behavioral Analysis

Instagram tracks how users interact with the platform. Automated tools typically exhibit patterns that differ from human behavior: consistent timing between requests, linear navigation patterns, and accessing content types in sequences that real users would not follow.

Authentication and Session Tracking

Instagram heavily restricts what unauthenticated users can see. Even logged-in users face rate limits on how many profiles they can view or searches they can perform within a time window.

Device Fingerprinting

The Instagram app and website collect device characteristics — screen resolution, browser plugins, GPU information, and more — to create a fingerprint that persists across sessions.

Setting Up Your Environment

pip install requests beautifulsoup4 instaloader pandas pillow

We use instaloader as a foundation and supplement it with custom requests for data that the library does not cover directly.

Approach 1: Using Instagram’s Web API

Instagram’s website makes GraphQL API calls that return structured JSON data. These endpoints are the most efficient way to extract data.

Configure Session with Proxy

import requests
import json
import time
import random
from datetime import datetime

class InstagramScraper:
    """Instagram scraper using web API endpoints with proxy support."""

    GRAPHQL_URL = "https://www.instagram.com/graphql/query/"
    BASE_URL = "https://www.instagram.com"

    # GraphQL query hashes (these may change — update as needed)
    USER_QUERY_HASH = "c9100bf9110dd6361671f113dd02e7d6"
    MEDIA_QUERY_HASH = "e769aa130647d2571c27c44596cb68c1"
    HASHTAG_QUERY_HASH = "174a21c41c89e3c8e0e7cc41f3e3ccab"

    def __init__(self, proxy_url, session_id=None):
        self.session = requests.Session()
        self.session.proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                "Version/17.0 Mobile/15E148 Safari/604.1"
            ),
            "Accept": "*/*",
            "Accept-Language": "en-US,en;q=0.9",
            "X-IG-App-ID": "936619743392459",
            "X-Requested-With": "XMLHttpRequest",
            "Referer": "https://www.instagram.com/",
        })

        if session_id:
            self.session.cookies.set("sessionid", session_id, domain=".instagram.com")

    def _request_with_retry(self, url, params=None, max_retries=3):
        """Make a request with retry logic and respectful delays."""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, params=params, timeout=20)

                if response.status_code == 200:
                    return response.json()
                elif response.status_code == 429:
                    wait_time = random.uniform(30, 60)
                    print(f"Rate limited. Waiting {wait_time:.0f}s...")
                    time.sleep(wait_time)
                elif response.status_code == 401:
                    print("Authentication required. Session may have expired.")
                    return None
                else:
                    print(f"Status {response.status_code}, attempt {attempt + 1}")
                    time.sleep(random.uniform(5, 15))

            except requests.exceptions.RequestException as e:
                print(f"Request error: {e}")
                time.sleep(random.uniform(5, 10))

        return None

Scrape User Profile Data

    def get_user_profile(self, username):
        """Extract complete profile information for a user."""
        url = f"{self.BASE_URL}/api/v1/users/web_profile_info/"
        params = {"username": username}

        data = self._request_with_retry(url, params)
        if not data:
            return None

        user = data.get("data", {}).get("user", {})
        if not user:
            return None

        return {
            "username": user.get("username"),
            "full_name": user.get("full_name"),
            "biography": user.get("biography"),
            "followers": user.get("edge_followed_by", {}).get("count"),
            "following": user.get("edge_follow", {}).get("count"),
            "posts_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
            "is_verified": user.get("is_verified"),
            "is_private": user.get("is_private"),
            "is_business": user.get("is_business_account"),
            "business_category": user.get("business_category_name"),
            "profile_pic_url": user.get("profile_pic_url_hd"),
            "external_url": user.get("external_url"),
            "user_id": user.get("id"),
        }

Scrape User Posts

    def get_user_posts(self, user_id, num_posts=50):
        """Fetch posts from a user's timeline."""
        posts = []
        end_cursor = None
        has_next = True

        while has_next and len(posts) < num_posts:
            variables = {
                "id": user_id,
                "first": min(12, num_posts - len(posts)),
            }
            if end_cursor:
                variables["after"] = end_cursor

            params = {
                "query_hash": self.MEDIA_QUERY_HASH,
                "variables": json.dumps(variables),
            }

            data = self._request_with_retry(self.GRAPHQL_URL, params)
            if not data:
                break

            media = (
                data.get("data", {})
                .get("user", {})
                .get("edge_owner_to_timeline_media", {})
            )

            for edge in media.get("edges", []):
                node = edge.get("node", {})
                post = {
                    "id": node.get("id"),
                    "shortcode": node.get("shortcode"),
                    "url": f"https://www.instagram.com/p/{node.get('shortcode')}/",
                    "type": node.get("__typename"),
                    "timestamp": node.get("taken_at_timestamp"),
                    "date": datetime.fromtimestamp(
                        node.get("taken_at_timestamp", 0)
                    ).isoformat(),
                    "likes": node.get("edge_media_preview_like", {}).get("count"),
                    "comments": node.get("edge_media_to_comment", {}).get("count"),
                    "caption": (
                        node.get("edge_media_to_caption", {})
                        .get("edges", [{}])[0]
                        .get("node", {})
                        .get("text")
                        if node.get("edge_media_to_caption", {}).get("edges")
                        else None
                    ),
                    "is_video": node.get("is_video"),
                    "video_views": node.get("video_view_count"),
                    "display_url": node.get("display_url"),
                    "dimensions": node.get("dimensions"),
                }
                posts.append(post)

            page_info = media.get("page_info", {})
            has_next = page_info.get("has_next_page", False)
            end_cursor = page_info.get("end_cursor")

            # Respectful delay between pagination requests
            time.sleep(random.uniform(2, 5))

        return posts

Scrape Hashtag Data

    def get_hashtag_posts(self, hashtag, num_posts=50):
        """Fetch recent posts from a hashtag page."""
        posts = []
        end_cursor = None
        has_next = True

        while has_next and len(posts) < num_posts:
            variables = {
                "tag_name": hashtag,
                "first": min(12, num_posts - len(posts)),
            }
            if end_cursor:
                variables["after"] = end_cursor

            params = {
                "query_hash": self.HASHTAG_QUERY_HASH,
                "variables": json.dumps(variables),
            }

            data = self._request_with_retry(self.GRAPHQL_URL, params)
            if not data:
                break

            hashtag_data = data.get("data", {}).get("hashtag", {})
            media = hashtag_data.get("edge_hashtag_to_media", {})

            for edge in media.get("edges", []):
                node = edge.get("node", {})
                post = {
                    "shortcode": node.get("shortcode"),
                    "url": f"https://www.instagram.com/p/{node.get('shortcode')}/",
                    "likes": node.get("edge_liked_by", {}).get("count"),
                    "comments": node.get("edge_media_to_comment", {}).get("count"),
                    "timestamp": node.get("taken_at_timestamp"),
                    "is_video": node.get("is_video"),
                    "caption": (
                        node.get("edge_media_to_caption", {})
                        .get("edges", [{}])[0]
                        .get("node", {})
                        .get("text")
                        if node.get("edge_media_to_caption", {}).get("edges")
                        else None
                    ),
                    "hashtag": hashtag,
                }
                posts.append(post)

            page_info = media.get("page_info", {})
            has_next = page_info.get("has_next_page", False)
            end_cursor = page_info.get("end_cursor")

            time.sleep(random.uniform(3, 6))

        return posts

Approach 2: Engagement Rate Calculator

One of the most common applications of Instagram scraping is calculating influencer engagement rates for social media marketing campaigns.

def calculate_engagement_metrics(profile, posts):
    """Calculate engagement metrics for an Instagram account."""
    if not posts or not profile.get("followers"):
        return None

    total_likes = sum(p.get("likes", 0) for p in posts)
    total_comments = sum(p.get("comments", 0) for p in posts)
    total_video_views = sum(
        p.get("video_views", 0) for p in posts if p.get("is_video")
    )

    num_posts = len(posts)
    followers = profile["followers"]

    metrics = {
        "username": profile["username"],
        "followers": followers,
        "posts_analyzed": num_posts,
        "avg_likes": round(total_likes / num_posts, 1),
        "avg_comments": round(total_comments / num_posts, 1),
        "engagement_rate": round(
            ((total_likes + total_comments) / num_posts / followers) * 100, 3
        ),
        "like_rate": round((total_likes / num_posts / followers) * 100, 3),
        "comment_rate": round((total_comments / num_posts / followers) * 100, 3),
        "video_posts": sum(1 for p in posts if p.get("is_video")),
        "image_posts": sum(1 for p in posts if not p.get("is_video")),
    }

    if total_video_views > 0:
        video_posts = [p for p in posts if p.get("is_video")]
        metrics["avg_video_views"] = round(total_video_views / len(video_posts), 1)
        metrics["video_view_rate"] = round(
            (total_video_views / len(video_posts) / followers) * 100, 3
        )

    return metrics

Running the Complete Pipeline

def main():
    proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
    scraper = InstagramScraper(proxy_url, session_id="your_session_id")

    # Scrape multiple profiles
    usernames = ["natgeo", "nike", "nasa"]
    all_data = []

    for username in usernames:
        print(f"\nScraping @{username}...")

        # Get profile
        profile = scraper.get_user_profile(username)
        if not profile:
            print(f"Failed to scrape @{username}")
            continue

        print(f"  Followers: {profile['followers']:,}")

        # Get recent posts
        if not profile["is_private"]:
            posts = scraper.get_user_posts(profile["user_id"], num_posts=30)
            print(f"  Posts scraped: {len(posts)}")

            # Calculate engagement
            metrics = calculate_engagement_metrics(profile, posts)
            if metrics:
                print(f"  Engagement rate: {metrics['engagement_rate']}%")

            all_data.append({
                "profile": profile,
                "posts": posts,
                "metrics": metrics,
            })
        else:
            print(f"  Account is private, skipping posts")

        time.sleep(random.uniform(10, 20))  # Long delay between accounts

    # Save results
    with open("instagram_data.json", "w", encoding="utf-8") as f:
        json.dump(all_data, f, indent=2, ensure_ascii=False, default=str)

    print(f"\nSaved data for {len(all_data)} accounts")


if __name__ == "__main__":
    main()

The Mobile Proxy Advantage for Instagram

Instagram’s detection systems are particularly attuned to the type of IP address making requests. Here is how different proxy types compare:

Proxy TypeSuccess RateDetection RiskCostBest For
Datacenter<10%Very HighLowNot recommended
Residential60-80%MediumMediumModerate volume
Mobile90-95%Very LowHigherHigh-volume scraping

Mobile proxies achieve the highest success rates because Instagram’s mobile app generates the majority of platform traffic. When your requests come from a mobile carrier IP, they blend seamlessly with legitimate app usage.

Anti-Detection Best Practices

Request Spacing

Instagram monitors request frequency aggressively. Follow these guidelines:

  • Between profile fetches: 10-20 seconds minimum
  • Between post pagination: 3-6 seconds
  • Between different data types: 15-30 seconds
  • Daily limits: Stay under 200 profiles per account per day

Session Management

Rotate your sessions strategically:

import random

class SessionManager:
    """Manage multiple Instagram sessions for rotation."""

    def __init__(self, session_ids, proxy_url):
        self.scrapers = [
            InstagramScraper(proxy_url, sid) for sid in session_ids
        ]
        self.current_index = 0

    def get_scraper(self):
        """Get the next scraper in rotation."""
        scraper = self.scrapers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.scrapers)
        return scraper

Cookie Management

Instagram tracks session cookies meticulously. Clear and regenerate cookies periodically, and ensure each session uses cookies consistent with the proxy location.

Data Storage and Analysis

For ongoing Instagram monitoring, structure your data pipeline for efficiency:

import pandas as pd

def analyze_collected_data(data_file):
    """Analyze scraped Instagram data for insights."""
    with open(data_file, "r") as f:
        data = json.load(f)

    # Build a DataFrame of engagement metrics
    metrics_list = [entry["metrics"] for entry in data if entry.get("metrics")]
    df = pd.DataFrame(metrics_list)

    print("Engagement Summary:")
    print(f"  Average engagement rate: {df['engagement_rate'].mean():.3f}%")
    print(f"  Highest engagement: @{df.loc[df['engagement_rate'].idxmax(), 'username']}")
    print(f"  Average likes per post: {df['avg_likes'].mean():,.0f}")
    print(f"  Average comments per post: {df['avg_comments'].mean():,.0f}")

    return df

Ethical Considerations

Instagram scraping raises important ethical questions. While the platform’s data is publicly visible, automated collection at scale requires careful consideration:

  • Privacy: Even public profiles belong to real people. Handle personal data responsibly.
  • Meta’s Terms: Instagram’s Terms of Use prohibit scraping. Meta has pursued legal action against scrapers in the past.
  • GDPR compliance: Processing European user data requires a lawful basis.
  • Competitive fairness: Using scraped data for manipulation (fake engagement, impersonation) crosses ethical lines.
  • Transparency: If you publish analysis based on scraped data, be transparent about your methodology.

Conclusion

Scraping Instagram profiles and posts is a powerful capability for marketing research, competitive analysis, and influencer evaluation. The key to doing it successfully — and sustainably — lies in combining intelligent code with quality proxy infrastructure.

Mobile proxies from DataResearchTools provide the authentication-level trust that Instagram’s detection systems require. By pairing these proxies with the careful session management and rate limiting practices outlined in this guide, you can build a reliable Instagram data collection pipeline.

For more social media scraping strategies, explore our other tutorials. Our proxy glossary provides definitions for all technical terms used in this guide.


Related Reading

Scroll to Top