Real estate agent and broker data is among the most commercially valuable information in the property industry. Knowing which agents dominate specific neighborhoods, what their transaction volumes look like, and how to reach them can power recruiting, marketing, vendor sales, and market research operations. But scraping personal and professional data carries unique ethical and legal responsibilities that go beyond typical web scraping projects. This guide covers the technical methods for collecting agent and broker data at scale, the proxy infrastructure required, and the critical ethical and legal frameworks you must operate within.
Why Agent and Broker Data Is Valuable
Real estate agents and brokers sit at the center of every residential transaction. Their data is valuable to many different stakeholders for many different reasons.
| Stakeholder | Use Case | Data Needed |
|---|---|---|
| Brokerages | Recruiting top-producing agents | Transaction volume, listings sold, contact info |
| Proptech companies | Selling software/services to agents | Agent contact info, brokerage affiliation, market focus |
| Mortgage lenders | Building referral partnerships | Agent production volume, geographic focus |
| Title companies | Marketing to active agents | Recent transaction activity, contact info |
| Investors | Finding agents who specialize in investment properties | Listing history, specializations |
| Market researchers | Analyzing market structure and competition | Agent density, brokerage market share |
The commercial demand for agent data is substantial, which makes it tempting to scrape broadly and aggressively. However, this is exactly the type of data collection where ethical guardrails are not optional — they are essential for legal compliance, reputation protection, and responsible business practices.
Sources of Agent and Broker Data
State Licensing Databases
Every real estate agent and broker must hold a state-issued license. State real estate commissions publish searchable databases of licensed professionals that include the agent’s name, license number, license type (agent vs. broker), status (active, inactive, expired), brokerage affiliation, and sometimes office address. These are public records maintained for consumer protection purposes.
State licensing databases are among the most legitimate sources for agent data because the information is published specifically for public access. However, the technical quality of these databases varies enormously. Some states offer modern searchable interfaces, while others provide only PDF downloads or require navigating legacy web applications.
MLS and Association Directories
Multiple Listing Services and local Realtor associations publish member directories on their websites. These directories typically include the agent’s name, photo, brokerage, phone number, email, and sometimes a list of active and recent listings. The data is more detailed than licensing databases but access is often restricted to members or protected by terms of service.
Real Estate Platform Profiles
Platforms like Zillow, Realtor.com, and Redfin publish agent profile pages that include transaction history, reviews, specializations, and contact information. These profiles aggregate data from multiple sources and present a comprehensive view of each agent’s activity. However, these platforms invest heavily in anti-scraping technology and their terms of service explicitly prohibit automated data collection.
Brokerage Websites
Individual brokerage firms — from national franchises like Keller Williams and RE/MAX to independent local brokerages — publish agent rosters on their websites. These typically include agent photos, contact information, biographies, and sometimes production statistics. Brokerage websites generally have lighter anti-scraping measures than major platforms but vary widely in their technical implementation.
Legal and Ethical Framework
Understanding the Legal Landscape
Scraping agent and broker data sits at the intersection of several legal frameworks. Understanding these is not optional — it is a prerequisite for any responsible data collection operation. For a comprehensive overview of the legal principles governing web scraping, see our detailed analysis of legal and ethical considerations for scraping with proxies.
| Legal Framework | Applicability | Key Requirements |
|---|---|---|
| Computer Fraud and Abuse Act (CFAA) | US federal law | Do not circumvent access controls or scrape password-protected areas |
| Terms of Service | Per-site contractual | Review and respect each site’s ToS regarding scraping |
| CAN-SPAM Act | US email marketing | If using scraped emails for marketing, comply with opt-out requirements |
| TCPA | US phone marketing | If using scraped phone numbers, comply with consent requirements |
| GDPR | EU residents’ data | Lawful basis required for processing personal data |
| CCPA/CPRA | California residents’ data | Right to know, delete, and opt out of data sales |
| State privacy laws | Various US states | Emerging patchwork of privacy regulations |
Ethical Guidelines for Agent Data Collection
Legal compliance is the floor, not the ceiling. Ethical data collection goes beyond what the law requires to respect the interests and expectations of the people whose data you are collecting. Follow these principles:
Transparency. If you are building a database of agent information, be prepared to explain what data you collect, how you collect it, and how you use it. If your data practices would embarrass you if made public, reconsider your approach.
Proportionality. Collect only the data you need for your specific use case. If you need agent names and brokerage affiliations for market analysis, you do not need their home addresses or personal phone numbers. Minimize the personal data you collect to what is directly relevant to your purpose.
Data minimization. Do not scrape everything just because you can. Targeted collection of specific data points is both more ethical and more practical than bulk harvesting of entire profiles.
Respect opt-out signals. If an agent contacts you to request removal from your database, honor the request promptly. Build opt-out mechanisms into your data management process from the beginning.
Secure storage. Personal data comes with a responsibility to protect it. Implement appropriate security measures — encryption at rest, access controls, audit logging — to prevent unauthorized access to your agent database.
Privacy Regulations and Personal Data
Agent contact information — email addresses, phone numbers, and office addresses — qualifies as personal data under most privacy regulations. This means that collecting and using this data triggers compliance obligations under applicable privacy laws.
Under GDPR, which applies if any of the agents in your database are EU residents or if your organization has EU operations, you need a lawful basis for processing personal data. Legitimate interest is the most common basis for B2B data collection, but it requires a documented balancing test showing that your interest does not override the individual’s rights.
Under CCPA and its successor CPRA, California residents have the right to know what personal data you have collected about them and to request deletion. If you are selling or sharing agent data, you must provide an opt-out mechanism. Several other states have enacted similar privacy laws with varying requirements.
Technical Implementation with Proxies
Proxy Setup for Agent Data Sources
Different agent data sources require different proxy approaches based on their anti-scraping measures and access patterns.
| Data Source | Proxy Type | Rate Limit | Key Challenge |
|---|---|---|---|
| State licensing databases | ISP or residential | 5-10 req/min | Legacy web apps, session management |
| MLS/association directories | Residential rotating | 3-8 req/min | Member-only access, CAPTCHAs |
| Zillow agent profiles | Residential rotating | 5-10 req/min | JavaScript rendering, fingerprinting |
| Realtor.com profiles | Residential rotating | 5-10 req/min | API-based loading, rate limiting |
| Brokerage websites | Datacenter or residential | 10-20 req/min | Varied site structures |
| Social media profiles | Residential or mobile | 2-5 req/min | Strong anti-scraping, legal risk |
Browser Fingerprinting Considerations
Agent profile pages on major real estate platforms are protected by sophisticated fingerprinting systems that go beyond IP detection. These systems analyze browser characteristics, JavaScript execution patterns, mouse movements, and dozens of other signals to identify automated access.
For sites with advanced fingerprinting, consider using anti-detect browser configurations that randomize fingerprint parameters across browsing sessions. Each scraping session should present a unique but internally consistent browser fingerprint — a combination of user agent, screen resolution, installed fonts, WebGL renderer, and other characteristics that together resemble a real user’s browser. For an in-depth look at anti-detect browser technology, see our article on anti-detect browsers and their applications.
Scraping State Licensing Databases
State licensing databases are the most defensible data source for agent information because the data is published for public access. However, each state’s database is a separate scraping project with unique technical challenges.
Many state licensing sites use form-based searches that require submitting specific query parameters. To collect comprehensive data, you need to enumerate all possible search combinations — searching by last name initial, city, county, or license type to paginate through the full database. Some states limit results to 500 records per search, requiring you to use narrower search criteria to capture everything.
Use ISP proxies with static IPs for licensing databases. Many of these sites use session cookies that are tied to IP addresses, and IP changes will invalidate your session and force re-authentication. Rate limit conservatively — these are government systems with limited infrastructure, and aggressive scraping could degrade service for legitimate users.
Parsing Agent Profile Pages
Agent profiles across different platforms contain similar data but in vastly different HTML structures. Build platform-specific parsers that extract standardized fields from each source. Your normalized agent record should include name, license number, license state, license type, brokerage name, office address, phone number, email address, years of experience, and transaction history summary.
Pay special attention to data quality when parsing transaction history. Some platforms report closed transactions, others report listings taken. Some include dollar volumes, others only show transaction counts. Document what each source reports so you can accurately compare agents across data sources.
Building an Agent Database
Schema Design
Design your database to handle data from multiple sources with potential conflicts. An agent might have different phone numbers listed on their brokerage website and their Zillow profile. Your schema should store source-attributed data — recording that Source A reports phone number X and Source B reports phone number Y — rather than arbitrarily choosing one value.
Core tables should include agents (unique individuals), licenses (one per state per agent), brokerage affiliations (historical with date ranges), contact information (source-attributed), and transaction history (source-attributed with date ranges). This structure supports deduplication, conflict resolution, and historical tracking.
Deduplication and Entity Resolution
The biggest technical challenge in building an agent database is deduplication. The same agent appears on multiple platforms with slightly different name spellings, different contact information, and different brokerage affiliations (especially if they recently switched firms). You need entity resolution logic that identifies when records from different sources refer to the same person.
License numbers are the most reliable deduplication key — if two records share the same state license number, they are the same agent. When license numbers are not available, use fuzzy matching on name, office address, and brokerage affiliation. A combination of name similarity (accounting for nicknames and abbreviations) and brokerage match produces reliable deduplication in most cases.
Data Freshness and Maintenance
Agent data changes frequently. Agents switch brokerages, change phone numbers, let licenses lapse, or retire from the industry. Schedule regular re-scraping of all sources to keep your database current. State licensing databases should be re-scraped quarterly to catch license status changes. Agent profiles on major platforms should be refreshed monthly to capture brokerage changes and updated contact information.
Implement change detection that flags significant updates — a brokerage change, a license status change, or a jump in transaction volume. These changes represent actionable events for recruiting, marketing, and partnership outreach.
Responsible Use of Agent Data
Marketing Compliance
If you plan to use scraped agent data for marketing outreach — email campaigns, phone calls, or direct mail — you must comply with applicable marketing laws. The CAN-SPAM Act requires that commercial emails include a physical mailing address, a clear opt-out mechanism, and honest subject lines. The TCPA restricts telemarketing calls and text messages, with particularly strict rules around auto-dialers and pre-recorded messages.
Best practices for marketing to agents using scraped data include sending only relevant, valuable communications, honoring opt-out requests within 10 business days (CAN-SPAM requires this), never misrepresenting your identity or the purpose of your outreach, and maintaining a suppression list of agents who have opted out.
Data Sharing and Sales
If you plan to sell or share your agent database with third parties, additional legal obligations apply. Under CCPA, you must provide California residents with the ability to opt out of data sales. Under GDPR, sharing personal data with third parties may require explicit consent or a separate lawful basis. Even where not legally required, providing transparency about data sharing builds trust and reduces legal risk.
Security and Access Controls
A database of agent contact information is a valuable target for malicious actors. Implement security measures proportional to the sensitivity of the data. Encrypt the database at rest and in transit. Restrict access to authorized personnel only. Log all data access and exports. Conduct regular security audits. If you experience a data breach, notify affected individuals and relevant authorities as required by applicable breach notification laws.
Practical Tips for Agent Data Projects
Focus on quality over quantity. A database of 10,000 verified, current agent records with accurate contact information is far more valuable than 100,000 stale, unverified records scraped indiscriminately. Invest in data validation and freshness rather than maximizing record count.
Start with public licensing data as your foundation. Build your database on the solid ground of state-published records, then enrich with data from other sources. This approach gives you a defensible data provenance story — your core data comes from public records intended for public access.
Respect robots.txt and terms of service. While the legal enforceability of robots.txt is debated, following it demonstrates good faith. If a site’s ToS explicitly prohibits scraping, consider whether the data is available from a more permissive source or whether an API or data licensing arrangement is more appropriate.
Build opt-out handling into your system from the beginning, not as an afterthought. Agents who discover they are in your database will contact you to request removal. Having a smooth, automated process for handling these requests is both legally required under many privacy laws and practically essential for maintaining your reputation.
Document your data practices. Maintain records of what data you collect, from which sources, for what purposes, and what legal basis supports the collection. This documentation is required under GDPR and is increasingly expected under emerging US privacy laws. It also protects you in the event of a legal challenge.
Frequently Asked Questions
Is it legal to scrape real estate agent contact information from public websites?
The legality depends on the source, the method, and how you use the data. Scraping publicly available state licensing records is generally permissible because these are public records published for public access. Scraping agent profiles from commercial platforms like Zillow or Realtor.com may violate their terms of service, which could expose you to breach of contract claims. The hiQ v. LinkedIn decision supports the legality of scraping publicly accessible data, but this area of law continues to evolve. Using scraped data for marketing must comply with CAN-SPAM, TCPA, and applicable privacy laws regardless of how the data was obtained.
How do I handle agents who request to be removed from my database?
Honor all removal requests promptly — ideally within 48 hours and no later than 30 days. Add the agent to a permanent suppression list so their data is not re-collected in future scraping runs. Under CCPA, California residents have an explicit right to deletion. Under GDPR, EU residents have a right to erasure. Even where no specific law mandates it, honoring opt-out requests is ethically necessary and protects your reputation. Automate the opt-out process so it does not require manual intervention for each request.
What data points about agents are considered “public” versus “private”?
Information published on public-facing websites, in state licensing databases, and in public records like recorded transactions is generally considered publicly accessible. However, “publicly accessible” and “free to collect and use without restriction” are not the same thing. An agent’s business email published on their brokerage website is more clearly public than their personal cell phone number shared in a private directory. Apply the principle of proportionality — collect data that is openly published for business purposes and avoid data that appears to be shared in a more private context.
Do I need different proxies for scraping each state’s licensing database?
Ideally, yes. Some state licensing databases restrict access to IP addresses within their state or flag out-of-state access for additional scrutiny. Using ISP proxies geolocated to the same state as the licensing database you are scraping improves success rates and reduces the chance of being blocked. At minimum, use US-based residential or ISP proxies for all state licensing sites. If you are scraping licensing data across many states, build a proxy allocation system that matches proxies to states based on geography.
How can I verify that scraped agent data is accurate and current?
Cross-reference data across multiple sources. If an agent’s license status, brokerage affiliation, and contact information are consistent across the state licensing database, their brokerage website, and their real estate platform profile, you can be confident the data is accurate. When sources conflict, the state licensing database is the most authoritative for license status and brokerage affiliation, while the agent’s own brokerage profile is most likely to have current contact information. Flag records with significant cross-source discrepancies for manual verification before using them for outreach.