Data Scraping and Lead Generation Model
Solo Engineer

Data Scraping and Lead Generation Model

The Architecture

Most lead generation tools are either paywalled SaaS products or brittle scripts that break on the first JS-heavy website. This system was built to be neither — a production-grade pipeline owned end-to-end, designed for repeatability across any industry and any geography.

The architecture is intentionally modular. Discovery, enrichment, deduplication, filtering, and export are isolated concerns — each independently replaceable without touching the rest of the system. The primary data source is a structured Maps API, with a headless browser fallback for when quota runs out. Every raw business discovered passes through a multi-stage enrichment pipeline before a single lead is written to memory.

Contact extraction is where most scrapers fail silently. This one doesn't — the email extraction layer is hardened against infrastructure noise, analytics tags, and freemail fallbacks, with a priority ranking that surfaces company-domain addresses first. Phone numbers are normalized to international format using a country hint derived from the location input itself — so +91 is never confused with +1.

The frontend is a Next.js dashboard that receives results in real-time as leads qualify — not after the job completes. A live discard counter shows exactly how many businesses were dropped and why. One button exports the full session to Excel, with the highest-quality leads (both email and phone present) sorted to the top and highlighted.

The system runs entirely without a database — data lives in memory for the session, making it portable, zero-infrastructure, and deployable to any VPS in under ten minutes.

Strategic Methodology

Pipeline-first, not feature-first. Every architectural decision — SSE over WebSockets, Cheerio before Playwright, in-memory over persistent storage — was made to keep the system fast, lightweight, and maintainable by one person. The constraint of no database is a feature, not a limitation.

Engineering Challenges

  • Deciding at runtime whether to use a lightweight HTML parser or a full headless browser — per website, per request — without slowing down the overall pipeline
  • Building an email extraction layer that rejects infrastructure noise, analytics addresses, and freemail fallbacks while still recovering valid contacts from unconventional page structures
  • Handling domain deduplication correctly across subdomains and multi-part TLDs without false positives
  • Implementing a stop handler that gracefully drains in-flight requests, enforces a hard timeout, and force-closes browser contexts — while preserving all collected leads for export

Project Impact

"Built a full pipeline that goes from keyword + location to a downloadable Excel of qualified business leads — no manual lookup, no guessing because The hardest client to find is the one you haven't automated yet."

Core Arsenal

Next.jsTypeScript Playwright exceljs TailwindCSS Crawlee Cheerio
STATUS: DEPLOYED
Intelligence Unit

Technical Log.

A high-fidelity breakdown of the build's architectural achievements and performance markers.

Synthesis

"A self-contained lead generation pipeline given a keyword and location, it discovers businesses, enriches them with details scraped , filters out dead-ends, and exports a ranked Excel file. Built with Next.js, Node.js, Playwright, and TypeScript. "

Hard Evidence

500–800 qualified leads per run after multi-stage filtering, from ~1000 raw discoveries Email enrichment succeeds on approximately 90–95% of business websites
Phone data recovered on 90–98% of leads — primarily from the discovery source, supplemented from website scraping
Stop handler enforces a 10-second hard timeout with force-closure of all active browser contexts
Single-codebase pipeline handles global geography via ISO country code resolution from the location input
Marker 1

Real-time lead streaming via SSE — rows appear as they qualify, not after job completion

Marker 2

Multi-stage contact enrichment with hardened email filtering and international phone normalisation

Marker 3

Dynamic scraping strategy per website — fast-path for static pages, full browser for JS-heavy ones

Marker 4

In-memory architecture : no database, no infrastructure overhead, export-and-go

Marker 5

Excel export with automatic quality tiering, best leads always appear first

Query Archive

01Why does the system use two different scraping methods?
Not every website needs a full browser. A lightweight detection step checks for JS framework markers before deciding — static pages are parsed directly (faster, no browser overhead), JS-heavy pages go through a full headless browser with proper render waiting. This meaningfully reduces total scrape time across a 100-lead run.
02Why no database?
Deliberate constraint. Data lives in memory for the session only — export before restarting. This makes the system portable, zero-infrastructure, and deployable anywhere. Persistent storage is a planned future enhancement if cross-session history becomes a requirement.
03How does it avoid picking up fake or useless emails?
The extraction layer runs a ranked priority system company-domain addresses are preferred, infrastructure and no-reply patterns are rejected outright, and free-mail addresses are only used as a last resort if no company email exists. Script-injected and analytics noise emails are also filtered by pattern.