
Data Scraping and Lead Generation Model
The Architecture
Most lead generation tools are either paywalled SaaS products or brittle scripts that break on the first JS-heavy website. This system was built to be neither — a production-grade pipeline owned end-to-end, designed for repeatability across any industry and any geography.
The architecture is intentionally modular. Discovery, enrichment, deduplication, filtering, and export are isolated concerns — each independently replaceable without touching the rest of the system. The primary data source is a structured Maps API, with a headless browser fallback for when quota runs out. Every raw business discovered passes through a multi-stage enrichment pipeline before a single lead is written to memory.
Contact extraction is where most scrapers fail silently. This one doesn't — the email extraction layer is hardened against infrastructure noise, analytics tags, and freemail fallbacks, with a priority ranking that surfaces company-domain addresses first. Phone numbers are normalized to international format using a country hint derived from the location input itself — so +91 is never confused with +1.
The frontend is a Next.js dashboard that receives results in real-time as leads qualify — not after the job completes. A live discard counter shows exactly how many businesses were dropped and why. One button exports the full session to Excel, with the highest-quality leads (both email and phone present) sorted to the top and highlighted.
The system runs entirely without a database — data lives in memory for the session, making it portable, zero-infrastructure, and deployable to any VPS in under ten minutes.
Strategic Methodology
Pipeline-first, not feature-first. Every architectural decision — SSE over WebSockets, Cheerio before Playwright, in-memory over persistent storage — was made to keep the system fast, lightweight, and maintainable by one person. The constraint of no database is a feature, not a limitation.
Engineering Challenges
- Deciding at runtime whether to use a lightweight HTML parser or a full headless browser — per website, per request — without slowing down the overall pipeline
- Building an email extraction layer that rejects infrastructure noise, analytics addresses, and freemail fallbacks while still recovering valid contacts from unconventional page structures
- Handling domain deduplication correctly across subdomains and multi-part TLDs without false positives
- Implementing a stop handler that gracefully drains in-flight requests, enforces a hard timeout, and force-closes browser contexts — while preserving all collected leads for export
Project Impact
"Built a full pipeline that goes from keyword + location to a downloadable Excel of qualified business leads — no manual lookup, no guessing because The hardest client to find is the one you haven't automated yet."
Core Arsenal
Technical Log.
A high-fidelity breakdown of the build's architectural achievements and performance markers.
Synthesis
"A self-contained lead generation pipeline given a keyword and location, it discovers businesses, enriches them with details scraped , filters out dead-ends, and exports a ranked Excel file. Built with Next.js, Node.js, Playwright, and TypeScript. "
Hard Evidence
Real-time lead streaming via SSE — rows appear as they qualify, not after job completion
Multi-stage contact enrichment with hardened email filtering and international phone normalisation
Dynamic scraping strategy per website — fast-path for static pages, full browser for JS-heavy ones
In-memory architecture : no database, no infrastructure overhead, export-and-go
Excel export with automatic quality tiering, best leads always appear first