How to Scrape Instagram Profiles and Posts Without Getting Blocked
Instagram is one of the most data-rich social platforms in existence. With over two billion monthly active users, it serves as a critical data source for influencer marketing, brand monitoring, competitive analysis, and trend research. The challenge lies in extracting this data without triggering Instagram’s formidable anti-bot defenses.
In this comprehensive guide, we walk through building an Instagram scraper in Python that leverages mobile proxies to remain undetected while extracting profiles, posts, hashtags, and engagement metrics at scale.
Understanding Instagram’s Detection Systems
Instagram, owned by Meta, has invested heavily in anti-automation technology. Their detection systems operate on multiple levels:
IP-Level Detection
Instagram maintains extensive blocklists of known datacenter IP ranges and VPN endpoints. They monitor request frequency per IP and flag addresses that exceed normal browsing patterns. This is why residential and mobile proxies are essential — they use IP addresses assigned to real internet subscribers.
Behavioral Analysis
Instagram tracks how users interact with the platform. Automated tools typically exhibit patterns that differ from human behavior: consistent timing between requests, linear navigation patterns, and accessing content types in sequences that real users would not follow.
Authentication and Session Tracking
Instagram heavily restricts what unauthenticated users can see. Even logged-in users face rate limits on how many profiles they can view or searches they can perform within a time window.
Device Fingerprinting
The Instagram app and website collect device characteristics — screen resolution, browser plugins, GPU information, and more — to create a fingerprint that persists across sessions.
Setting Up Your Environment
pip install requests beautifulsoup4 instaloader pandas pillowWe use instaloader as a foundation and supplement it with custom requests for data that the library does not cover directly.
Approach 1: Using Instagram’s Web API
Instagram’s website makes GraphQL API calls that return structured JSON data. These endpoints are the most efficient way to extract data.
Configure Session with Proxy
import requests
import json
import time
import random
from datetime import datetime
class InstagramScraper:
"""Instagram scraper using web API endpoints with proxy support."""
GRAPHQL_URL = "https://www.instagram.com/graphql/query/"
BASE_URL = "https://www.instagram.com"
# GraphQL query hashes (these may change — update as needed)
USER_QUERY_HASH = "c9100bf9110dd6361671f113dd02e7d6"
MEDIA_QUERY_HASH = "e769aa130647d2571c27c44596cb68c1"
HASHTAG_QUERY_HASH = "174a21c41c89e3c8e0e7cc41f3e3ccab"
def __init__(self, proxy_url, session_id=None):
self.session = requests.Session()
self.session.proxies = {
"http": proxy_url,
"https": proxy_url,
}
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.0 Mobile/15E148 Safari/604.1"
),
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.9",
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/",
})
if session_id:
self.session.cookies.set("sessionid", session_id, domain=".instagram.com")
def _request_with_retry(self, url, params=None, max_retries=3):
"""Make a request with retry logic and respectful delays."""
for attempt in range(max_retries):
try:
response = self.session.get(url, params=params, timeout=20)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
wait_time = random.uniform(30, 60)
print(f"Rate limited. Waiting {wait_time:.0f}s...")
time.sleep(wait_time)
elif response.status_code == 401:
print("Authentication required. Session may have expired.")
return None
else:
print(f"Status {response.status_code}, attempt {attempt + 1}")
time.sleep(random.uniform(5, 15))
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
time.sleep(random.uniform(5, 10))
return NoneScrape User Profile Data
def get_user_profile(self, username):
"""Extract complete profile information for a user."""
url = f"{self.BASE_URL}/api/v1/users/web_profile_info/"
params = {"username": username}
data = self._request_with_retry(url, params)
if not data:
return None
user = data.get("data", {}).get("user", {})
if not user:
return None
return {
"username": user.get("username"),
"full_name": user.get("full_name"),
"biography": user.get("biography"),
"followers": user.get("edge_followed_by", {}).get("count"),
"following": user.get("edge_follow", {}).get("count"),
"posts_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
"is_verified": user.get("is_verified"),
"is_private": user.get("is_private"),
"is_business": user.get("is_business_account"),
"business_category": user.get("business_category_name"),
"profile_pic_url": user.get("profile_pic_url_hd"),
"external_url": user.get("external_url"),
"user_id": user.get("id"),
}Scrape User Posts
def get_user_posts(self, user_id, num_posts=50):
"""Fetch posts from a user's timeline."""
posts = []
end_cursor = None
has_next = True
while has_next and len(posts) < num_posts:
variables = {
"id": user_id,
"first": min(12, num_posts - len(posts)),
}
if end_cursor:
variables["after"] = end_cursor
params = {
"query_hash": self.MEDIA_QUERY_HASH,
"variables": json.dumps(variables),
}
data = self._request_with_retry(self.GRAPHQL_URL, params)
if not data:
break
media = (
data.get("data", {})
.get("user", {})
.get("edge_owner_to_timeline_media", {})
)
for edge in media.get("edges", []):
node = edge.get("node", {})
post = {
"id": node.get("id"),
"shortcode": node.get("shortcode"),
"url": f"https://www.instagram.com/p/{node.get('shortcode')}/",
"type": node.get("__typename"),
"timestamp": node.get("taken_at_timestamp"),
"date": datetime.fromtimestamp(
node.get("taken_at_timestamp", 0)
).isoformat(),
"likes": node.get("edge_media_preview_like", {}).get("count"),
"comments": node.get("edge_media_to_comment", {}).get("count"),
"caption": (
node.get("edge_media_to_caption", {})
.get("edges", [{}])[0]
.get("node", {})
.get("text")
if node.get("edge_media_to_caption", {}).get("edges")
else None
),
"is_video": node.get("is_video"),
"video_views": node.get("video_view_count"),
"display_url": node.get("display_url"),
"dimensions": node.get("dimensions"),
}
posts.append(post)
page_info = media.get("page_info", {})
has_next = page_info.get("has_next_page", False)
end_cursor = page_info.get("end_cursor")
# Respectful delay between pagination requests
time.sleep(random.uniform(2, 5))
return postsScrape Hashtag Data
def get_hashtag_posts(self, hashtag, num_posts=50):
"""Fetch recent posts from a hashtag page."""
posts = []
end_cursor = None
has_next = True
while has_next and len(posts) < num_posts:
variables = {
"tag_name": hashtag,
"first": min(12, num_posts - len(posts)),
}
if end_cursor:
variables["after"] = end_cursor
params = {
"query_hash": self.HASHTAG_QUERY_HASH,
"variables": json.dumps(variables),
}
data = self._request_with_retry(self.GRAPHQL_URL, params)
if not data:
break
hashtag_data = data.get("data", {}).get("hashtag", {})
media = hashtag_data.get("edge_hashtag_to_media", {})
for edge in media.get("edges", []):
node = edge.get("node", {})
post = {
"shortcode": node.get("shortcode"),
"url": f"https://www.instagram.com/p/{node.get('shortcode')}/",
"likes": node.get("edge_liked_by", {}).get("count"),
"comments": node.get("edge_media_to_comment", {}).get("count"),
"timestamp": node.get("taken_at_timestamp"),
"is_video": node.get("is_video"),
"caption": (
node.get("edge_media_to_caption", {})
.get("edges", [{}])[0]
.get("node", {})
.get("text")
if node.get("edge_media_to_caption", {}).get("edges")
else None
),
"hashtag": hashtag,
}
posts.append(post)
page_info = media.get("page_info", {})
has_next = page_info.get("has_next_page", False)
end_cursor = page_info.get("end_cursor")
time.sleep(random.uniform(3, 6))
return postsApproach 2: Engagement Rate Calculator
One of the most common applications of Instagram scraping is calculating influencer engagement rates for social media marketing campaigns.
def calculate_engagement_metrics(profile, posts):
"""Calculate engagement metrics for an Instagram account."""
if not posts or not profile.get("followers"):
return None
total_likes = sum(p.get("likes", 0) for p in posts)
total_comments = sum(p.get("comments", 0) for p in posts)
total_video_views = sum(
p.get("video_views", 0) for p in posts if p.get("is_video")
)
num_posts = len(posts)
followers = profile["followers"]
metrics = {
"username": profile["username"],
"followers": followers,
"posts_analyzed": num_posts,
"avg_likes": round(total_likes / num_posts, 1),
"avg_comments": round(total_comments / num_posts, 1),
"engagement_rate": round(
((total_likes + total_comments) / num_posts / followers) * 100, 3
),
"like_rate": round((total_likes / num_posts / followers) * 100, 3),
"comment_rate": round((total_comments / num_posts / followers) * 100, 3),
"video_posts": sum(1 for p in posts if p.get("is_video")),
"image_posts": sum(1 for p in posts if not p.get("is_video")),
}
if total_video_views > 0:
video_posts = [p for p in posts if p.get("is_video")]
metrics["avg_video_views"] = round(total_video_views / len(video_posts), 1)
metrics["video_view_rate"] = round(
(total_video_views / len(video_posts) / followers) * 100, 3
)
return metricsRunning the Complete Pipeline
def main():
proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
scraper = InstagramScraper(proxy_url, session_id="your_session_id")
# Scrape multiple profiles
usernames = ["natgeo", "nike", "nasa"]
all_data = []
for username in usernames:
print(f"\nScraping @{username}...")
# Get profile
profile = scraper.get_user_profile(username)
if not profile:
print(f"Failed to scrape @{username}")
continue
print(f" Followers: {profile['followers']:,}")
# Get recent posts
if not profile["is_private"]:
posts = scraper.get_user_posts(profile["user_id"], num_posts=30)
print(f" Posts scraped: {len(posts)}")
# Calculate engagement
metrics = calculate_engagement_metrics(profile, posts)
if metrics:
print(f" Engagement rate: {metrics['engagement_rate']}%")
all_data.append({
"profile": profile,
"posts": posts,
"metrics": metrics,
})
else:
print(f" Account is private, skipping posts")
time.sleep(random.uniform(10, 20)) # Long delay between accounts
# Save results
with open("instagram_data.json", "w", encoding="utf-8") as f:
json.dump(all_data, f, indent=2, ensure_ascii=False, default=str)
print(f"\nSaved data for {len(all_data)} accounts")
if __name__ == "__main__":
main()The Mobile Proxy Advantage for Instagram
Instagram’s detection systems are particularly attuned to the type of IP address making requests. Here is how different proxy types compare:
| Proxy Type | Success Rate | Detection Risk | Cost | Best For |
|---|---|---|---|---|
| Datacenter | <10% | Very High | Low | Not recommended |
| Residential | 60-80% | Medium | Medium | Moderate volume |
| Mobile | 90-95% | Very Low | Higher | High-volume scraping |
Mobile proxies achieve the highest success rates because Instagram’s mobile app generates the majority of platform traffic. When your requests come from a mobile carrier IP, they blend seamlessly with legitimate app usage.
Anti-Detection Best Practices
Request Spacing
Instagram monitors request frequency aggressively. Follow these guidelines:
- Between profile fetches: 10-20 seconds minimum
- Between post pagination: 3-6 seconds
- Between different data types: 15-30 seconds
- Daily limits: Stay under 200 profiles per account per day
Session Management
Rotate your sessions strategically:
import random
class SessionManager:
"""Manage multiple Instagram sessions for rotation."""
def __init__(self, session_ids, proxy_url):
self.scrapers = [
InstagramScraper(proxy_url, sid) for sid in session_ids
]
self.current_index = 0
def get_scraper(self):
"""Get the next scraper in rotation."""
scraper = self.scrapers[self.current_index]
self.current_index = (self.current_index + 1) % len(self.scrapers)
return scraperCookie Management
Instagram tracks session cookies meticulously. Clear and regenerate cookies periodically, and ensure each session uses cookies consistent with the proxy location.
Data Storage and Analysis
For ongoing Instagram monitoring, structure your data pipeline for efficiency:
import pandas as pd
def analyze_collected_data(data_file):
"""Analyze scraped Instagram data for insights."""
with open(data_file, "r") as f:
data = json.load(f)
# Build a DataFrame of engagement metrics
metrics_list = [entry["metrics"] for entry in data if entry.get("metrics")]
df = pd.DataFrame(metrics_list)
print("Engagement Summary:")
print(f" Average engagement rate: {df['engagement_rate'].mean():.3f}%")
print(f" Highest engagement: @{df.loc[df['engagement_rate'].idxmax(), 'username']}")
print(f" Average likes per post: {df['avg_likes'].mean():,.0f}")
print(f" Average comments per post: {df['avg_comments'].mean():,.0f}")
return dfEthical Considerations
Instagram scraping raises important ethical questions. While the platform’s data is publicly visible, automated collection at scale requires careful consideration:
- Privacy: Even public profiles belong to real people. Handle personal data responsibly.
- Meta’s Terms: Instagram’s Terms of Use prohibit scraping. Meta has pursued legal action against scrapers in the past.
- GDPR compliance: Processing European user data requires a lawful basis.
- Competitive fairness: Using scraped data for manipulation (fake engagement, impersonation) crosses ethical lines.
- Transparency: If you publish analysis based on scraped data, be transparent about your methodology.
Conclusion
Scraping Instagram profiles and posts is a powerful capability for marketing research, competitive analysis, and influencer evaluation. The key to doing it successfully — and sustainably — lies in combining intelligent code with quality proxy infrastructure.
Mobile proxies from DataResearchTools provide the authentication-level trust that Instagram’s detection systems require. By pairing these proxies with the careful session management and rate limiting practices outlined in this guide, you can build a reliable Instagram data collection pipeline.
For more social media scraping strategies, explore our other tutorials. Our proxy glossary provides definitions for all technical terms used in this guide.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix