MODULE_12 // DATA RECONNAISSANCE & ATTACK SURFACE MAPPING

Recon Methodology

OPERATIONAL OBJECTIVE

Systematically enumerate a target's subdomains, historical endpoints, source-control exposure, and live infrastructure using a layered passive-then-active tooling pipeline — before issuing a single exploit payload.

Recon isn't the warm-up before "real hacking" in Chapters 7–11. On a mature target — one that's already been through a dozen pentests, with thousands of hunters poking the login form — the main application is usually the most hardened surface that exists. The bugs that actually pay are overwhelmingly found on infrastructure other hunters never looked at, because they never found it in the first place. A mediocre XSS skillset pointed at a forgotten staging server will outperform a brilliant XSS skillset pointed at a target everyone already tested. This chapter is where that leverage gets built.

Before You Touch Anything: Scope, Properly Read

Every program defines a scope, and reading it properly is a skill of its own — not just a glance at the asset table.

Wildcard scope (*.target.com) means everything under that root is fair game — generous, but still bounded. A wildcard on target.com says nothing about target-support.io, a separate brand domain the same company might own.
Explicit exclusions are listed separately from inclusions and are easy to miss if you only skim the "in scope" table. A program might list *.target.com as in-scope and then exclude legacy.target.com by name in a paragraph below the table — because that subdomain belongs to a recently acquired company still running its old, unaudited stack.
Multi-program company groups are common among large companies. A parent company might run three separate bounty programs for three product lines, each with different scope and different payout tiers — and a subdomain that looks like it belongs to one might actually fall under another program's rules, or under none at all.

The habit to build now: read the entire scope page, not just the asset table. Note exclusions as carefully as inclusions. When in doubt about whether a discovered asset is covered, ask the program — most have a contact channel for exactly this — rather than assume. Recon tools have no concept of any of this. They will find what's there to be found, in-scope or not. The judgment about what to do with that information is entirely yours.

⛔ Scope Discipline Is Not Optional

subfinder will happily enumerate subdomains that belong to a company but were explicitly excluded from their bounty program — common with recent acquisitions, legacy brands, or third-party-hosted services.

Testing or even aggressively scanning an out-of-scope asset can get you removed from a program, and in rare cases creates real legal exposure. Before you move from passive to active recon on any target, read the program's full scope page — exclusions as carefully as inclusions — and cross-check every asset you find against it.

The Two Phases — and Why Order Matters

Recon splits into two phases for a specific reason: risk management, not just convenience.

Passive recon means gathering information that already exists somewhere else — a certificate log, a search index, an archive, a public code repository — without ever sending a single packet to the target's own infrastructure. Nothing you do here is visible to the target, because you're not talking to them; you're talking to third parties who happen to have already collected information about them.

Active recon means directly interacting with the target's own systems — resolving their DNS, sending them HTTP requests, probing their ports. This is detectable. It can trip rate limits, get logged by a WAF, alert a SOC analyst watching a dashboard, or — if you've made a scope mistake — touch something you had no business touching. You do this second, after passive recon has already given you a strong picture, so that by the time you start generating detectable traffic, you're doing it efficiently and deliberately against assets you've already reasoned about, not blindly exploring.

PHASE 1 — PASSIVE RECON

Certificate Transparency Logs

Every publicly trusted TLS certificate issued by any certificate authority gets permanently logged in a public, append-only ledger system called Certificate Transparency (CT). It exists for a completely unrelated reason — to stop certificate authorities from secretly issuing fraudulent certificates for domains they don't control — but the side effect is one of the best recon sources available: a company cannot get a valid certificate for internal-staging.target.com without that hostname becoming permanently visible in CT logs, even years after the subdomain is decommissioned.

A few things worth understanding here, not just the tool:

Wildcard certificates (*.target.com) show up in CT logs too, but they don't reveal individual subdomain names — they just confirm the org uses wildcard certs somewhere, which is itself a small signal about their infrastructure maturity.
SAN (Subject Alternative Name) fields often list multiple hostnames on a single certificate — a cert issued primarily for target.com might also cover www.target.com, api.target.com, and mail.target.com in the same SAN list, all visible in one CT log entry.
Some CT log entries show as "redacted" for domain-validated certs issued for internal or pre-launch names — this itself can be a signal that something interesting hasn't launched yet.

crt.sh is the standard free web interface, and searching %.target.com (SQL wildcard syntax it supports directly) often surfaces subdomains nobody intended to advertise. For programmatic use, tlsx (ProjectDiscovery) queries CT logs and live TLS handshakes from the command line, and Censys offers a broader certificate and host search with a more powerful query language if you need to go deeper than crt.sh's interface allows.

Search Engine Dorking

Search engines have already crawled vast amounts of a target's public content, including things that were exposed by accident and indexed before anyone noticed. "Dorking" means using a search engine's advanced operators deliberately, instead of relying on luck.

The operators that matter most for recon work:

Operator	Effect
`site:`	Restrict results to one domain.
`inurl:`	Match a string anywhere in the URL path.
`intitle:`	Match a string in the page title.
`intext:`	Match a string in the visible page body.
`ext:` / `filetype:`	Restrict to a file extension.
`-` (minus prefix)	Exclude a term from results.

Combining extensions requires an explicit, capitalized OR inside parentheses — a bare pipe character is not valid search syntax on any major engine:

site:target.com (ext:env OR ext:log OR ext:sql OR ext:bak)

A few more patterns worth having memorized, because each targets a different class of accidental exposure:

site:target.com inurl:admin                  # exposed admin panels
site:target.com intitle:"index of"           # open directory listings
site:target.com ext:pdf intext:confidential  # leaked internal documents
site:target.com inurl:wp-content -inurl:plugins   # WordPress upload dirs, excluding plugin noise

One thing worth knowing about why this works at all: search engine caches lag behind the live site, sometimes by weeks. A dork can surface a file or page that's since been removed from the live server but is still sitting in the engine's cache — which is itself useful, because it tells you the route existed and may still resolve even if the link to it doesn't anymore.

The Wayback Machine and Its CDX API

The Internet Archive has been crawling and snapshotting the web since the late 1990s. The recon value isn't browsing old homepages for nostalgia — it's pulling the complete list of every URL the archive has ever observed for a domain, including query parameters and backend routes that vanished from the frontend years ago but may still resolve, because a redesign removed the link without anyone auditing whether the underlying server route was actually decommissioned.

Under the hood, this works through the Wayback Machine's CDX API — a queryable index of every snapshot the archive holds. A direct query looks like:

http://web.archive.org/cdx/search/cdx?url=target.com/*&output=text&fl=original&collapse=urlkey

That single request returns a flat list of every distinct URL the archive has under that domain. The CLI tool waybackurls (by tomnomnom) wraps this exact API so you don't have to construct the query by hand — feed it a domain, get back the URL list. Understanding that it's just a CDX query underneath means you can also hit the API directly when you need finer control, like filtering to a specific file extension or date range.

Source Control Recon — GitHub, GitLab, and Leaked Secrets

This is a passive technique that's easy to skip entirely if you're only thinking in terms of DNS and web crawlers, and it's consistently one of the highest-value sources available. Developers commit code to public or semi-public repositories constantly, and commit history is forever — deleting a file in a later commit does not remove it from the repository's history unless someone explicitly rewrites that history, which almost nobody does.

What to look for:

Org-wide repo enumeration — search GitHub for the company name, or use their public org page directly, to find every public repository they maintain, including ones nobody bothered to make private after a project wound down.
Secrets in commit history — a .env file or hardcoded API key committed once and "removed" in a later commit is still sitting in the git history of that repo, retrievable with git log -p or by browsing individual commits on GitHub directly.
Dorking GitHub itself — searches like org:target-company password or org:target-company "api_key" inside GitHub's own search surface code that was never meant to be public, sitting in forks, gists, or old branches.
Tools built for this specifically — gitleaks and trufflehog scan repositories (including full commit history, not just the current state) for patterns that look like credentials, tokens, or keys, automating what would otherwise be a lot of manual git log archaeology.

💰 Bounty Angle — Secrets Outweigh Exploits

A leaked cloud credential or internal API key found this way can be worth more than any web vulnerability you'll find later in this course — and it costs nothing but careful searching.

PHASE 2 — ACTIVE RECON

By this point you should already have a long list of candidate subdomains and historical URLs, all checked against scope. Now you start touching the target's own infrastructure, deliberately and in order.

DNS-Level Recon

DNS itself, queried directly, reveals more than most beginners expect.

Zone transfers (AXFR) — a deeply misconfigured DNS server will sometimes hand over its entire zone file to anyone who asks, listing every record it manages in one response. Rare on well-run infrastructure today, but free to check and occasionally still works:
```
dig axfr target.com @ns1.target.com
```
Reverse DNS lookups — given an IP address (discovered through other means), a reverse lookup can reveal a hostname that was never linked anywhere on the public web, sometimes exposing internal naming conventions that help you guess at other hidden assets.
ASN (Autonomous System Number) lookups — large organizations often own entire blocks of IP space. Tools like amass intel -org "Target Company" map a company name to its owned ASNs and IP ranges, which can surface infrastructure with no DNS record pointing at it at all, but that still responds if you connect to the IP directly.

Subdomain Enumeration, Properly Combined

Passive sources tell you what existed at some point; they don't confirm what's alive now, and no single source is complete. subfinder (ProjectDiscovery) aggregates dozens of passive sources into one command:

subfinder -d target.com -o subs_subfinder.txt

Running a second tool with a different source mix and merging results catches what any single tool misses — assetfinder and amass enum -passive are the common companions here:

assetfinder target.com >> subs_assetfinder.txt
amass enum -passive -d target.com -o subs_amass.txt
cat subs_*.txt | sort -u > all_subdomains.txt

This combine-and-deduplicate habit matters more than people expect — different tools query different passive sources under the hood, and the overlap between any two tools' results is usually far from total.

Port Scanning Before You Assume "Web Only"

Before assuming every live host only matters over HTTP/HTTPS, a fast port scan can reveal services nobody documented — an exposed database port, an admin interface on a non-standard port, a management panel that was never meant to be internet-facing at all. naabu (ProjectDiscovery) is built for exactly this — fast, designed to feed straight into the next tool in the pipeline:

naabu -l all_subdomains.txt -o open_ports.txt

Traditional nmap remains the deeper, slower option when you need service-version fingerprinting on a specific host rather than a fast sweep across many.

Probing for Live Web Hosts

Once you know what's alive and which ports are open, httpx (ProjectDiscovery) sends real HTTP requests and reports status codes, response headers, page titles, and detected technologies in one pass:

cat all_subdomains.txt | httpx -o live_hosts.txt -title -tech-detect -status-code

⛔ This Is The Line

This is the point where you're unambiguously generating traffic the target can see in their own logs. Confirm everything above is scope-checked before this step runs, not after.

Directory, File, and Virtual Host Fuzzing

Subdomains are one axis of attack surface; content within a single host is another, and there are two distinct things worth fuzzing for, not just one.

Path and file fuzzing brute-forces directory and file names against a target using a wordlist:

ffuf -u https://target.com/FUZZ -w wordlist.txt

Extending this to specific file extensions catches backup files and source artifacts a plain path wordlist misses:

ffuf -u https://target.com/FUZZ -w wordlist.txt -e .bak,.old,.zip,.sql,.env

Virtual host (vhost) fuzzing is a different technique entirely — instead of brute-forcing paths on a known host, you brute-force the Host header against a single IP address, looking for hidden virtual hosts that respond to a different hostname than the one you connected with. This catches internal applications co-located on shared infrastructure that were never given a public DNS record at all:

ffuf -u https://target.com -H "Host: FUZZ.target.com" -w subdomain_wordlist.txt -fs <baseline_response_size>

Both are noisier and slower than anything in the passive phase — point them only at confirmed, in-scope, live hosts.

Reading Technology Signatures

Once a host responds, the response itself — and a couple of files most sites still serve by convention — tells you a great deal:

Signal	What It Reveals
`Server` / `X-Powered-By` headers	Web server and backend framework — nginx, Apache, Express, a specific PHP version.
JS bundle filenames & inline comments	Frontend framework — React, Vue, Angular build artifacts, sometimes a literal version string.
JS source maps (`.js.map` files)	Can leak original, unminified source code if a build pipeline forgot to strip them in production.
Cookie names	Session framework — `PHPSESSID`, `JSESSIONID`, `connect.sid` for Express.
`robots.txt`	Often lists paths the developers specifically wanted hidden from search engines — which is exactly why a hunter should read it.
`sitemap.xml`	A structured map of every page the site wants indexed — frequently includes admin or staging paths accidentally left in.
`CNAME` records	Third-party services in use — S3 buckets, CDNs, SaaS platforms, and the starting point for subdomain takeover hunting (below).

A robots.txt that disallows /admin-panel-v2/ is, ironically, one of the most reliable ways a site will tell you exactly where its admin panel lives.

💰 Bounty Angle — Stack Shapes Strategy

Vulnerability classes aren't evenly distributed across technologies. An old, unpatched PHP backend raises your expectation of SQL injection or LFI. A target fronted by an S3 bucket raises your expectation of misconfigured bucket permissions. Knowing the stack tells you which later chapters of this course to reach for first on a given target — recon isn't separate from exploitation, it's what aims it.

Subdomain Takeover — A Vulnerability Class Recon Finds By Itself

This deserves its own section because it's a real, frequently-paid vulnerability class that exists entirely inside the recon process — you don't craft a payload for it, you find it by paying attention to what your subdomain list is telling you.

The Mechanism

A subdomain has a DNS record — usually a CNAME — pointing to a third-party service (GitHub Pages, Heroku, an Azure App Service, an AWS S3 bucket, dozens of others). At some point the team stops using that service and deletes the resource on the third-party side, but forgets to remove the DNS record pointing to it. The DNS record is now "dangling" — it still resolves, but to a service that no longer exists. Because most of these platforms let any customer claim a hostname that isn't currently in use, an attacker can register that exact resource name on the third-party platform and suddenly control whatever the target's own subdomain serves.

Why It's Worth Real Money

Once you control the content served at forgotten.target.com, you can potentially read or set cookies scoped to the parent domain (if those cookies aren't properly scoped or flagged), bypass a Content Security Policy that explicitly trusts *.target.com as a script source, host a phishing page on a hostname users and browsers already trust, or intercept OAuth/SSO redirect flows that whitelist the subdomain as a valid callback target. This is consistently rated medium-to-critical depending on what the parent domain trusts that subdomain to do.

Finding Candidates From Output You Already Have

Look for subdomains whose CNAME points to one of the well-known vulnerable patterns (*.github.io, *.herokuapp.com, *.azurewebsites.net, *.s3.amazonaws.com, and dozens more — the community-maintained can-i-take-over-xyz repository tracks the current list and exact fingerprint for each service), then visit the subdomain and look for the specific error response that service shows for an unclaimed resource — GitHub Pages shows "There isn't a GitHub Pages site here," a dangling S3 bucket returns a NoSuchBucket XML error, and so on. Automated tools speed this up across a large subdomain list:

subzy run --targets all_subdomains.txt
nuclei -l all_subdomains.txt -t nuclei-templates/takeovers/

📌 The Discipline Point

A dangling CNAME pointing to a fingerprint match is a candidate, not a confirmed bug. Programs are explicit that they don't want theoretical submissions — claim the resource (only when the program's rules permit it), serve a harmless proof-of-concept page identifying yourself and a timestamp, capture evidence, and remove it promptly. This is one of the few vulnerability classes in this entire course where the proof-of-concept step requires you to actually claim infrastructure, which makes the ethics of doing it cleanly matter more than usual.

Drill Sandbox 12 — Triage Console

A live feed of raw recon hits is streaming in against devcore-infrastructure.io. Each finding needs to land in the right bucket of the attack-surface ledger — or get discarded, if it's not actually fair game. Read the scope notice first. Misfiling an in-scope asset costs accuracy; misfiling an out-of-scope one is the mistake that gets hunters removed from programs.

TRIAGE CONSOLE v1.0 // devcore-infrastructure.io bounty program 0 / 14

SCOPE: *.devcore-infrastructure.io is in-scope (wildcard). EXCLUDED: legacy-auth.devcore-infrastructure.io (acquired entity, separate unaudited stack — explicitly out per program rules) and anything on devcore-support.io (different brand, different program).

SCORE: 0 STREAK: 0 MISFILED: 0

Press "Pull Next Finding" to start the feed.

Findings will appear above one at a time. File each one correctly before pulling the next.

Subdomains

Endpoints

Tech Signatures

Takeover Candidates

Discarded

0 / 14

Building the Attack Surface Ledger

Nothing above should live only in scrollback. Before testing anything, consolidate findings into a structured, living record — exactly the five buckets the console above just made you sort by hand:

Ledger Element	What Goes In It
Subdomain Inventory	Every live host, tagged by purpose (production, staging, internal tool, third-party integration), CNAME target if relevant, cross-checked against scope.
Endpoint Map	Every distinct path and parameter observed, especially ones with obvious data-shape signals — `id=`, `user=`, `redirect=`, `file=` — the parameter types most associated with IDOR (Chapter 9), open redirects, and SSRF (Chapter 11).
Tech Stack Signatures	What's running where, so you know which later chapters' techniques to prioritize on which host.
Takeover Candidates	Any dangling CNAME matches, flagged for manual verification before any claim attempt.
Leaked Secrets / Source Exposure	Anything found in git history, exposed `.env` files, or source maps — handled separately and carefully, since these often need immediate, responsible disclosure rather than routine reporting.

This ledger is what turns "I scanned a company" into "I have a prioritized, defensible plan." It also makes your eventual bug reports (Chapter 14) read very differently — a report that demonstrates you understood the target's structure lands very differently from one that looks like a lucky payload guess.

Your Hands-On Plan for This Chapter

1. Core Toolkit Deployment

Install projectdiscovery subfinder — github.com/projectdiscovery/subfinder

Install projectdiscovery httpx — github.com/projectdiscovery/httpx

Install projectdiscovery naabu — github.com/projectdiscovery/naabu

Install tomnomnom waybackurls binary — github.com/tomnomnom/waybackurls

Install ffuf fast web fuzzer tool instance — github.com/ffuf/ffuf

Install OWASP amass for ASN/passive enumeration — github.com/owasp-amass/amass

Install gitleaks for source-control secret scanning — github.com/gitleaks/gitleaks

2. Sandbox Recon Verification Exercises

Pick a program with a public scope page. Read the full scope, including exclusions, before touching any tool.

Execute a certificate search for that target on crt.sh, noting any SAN entries or redacted records.

Run the target through subfinder | httpx, compiling valid live responses into a local text ledger.

Pull the target's full historical URL list with waybackurls and flag any routes that no longer have a live frontend link.

Search GitHub for the target's org name and scan any public repos with gitleaks.

Check every discovered subdomain's CNAME against known subdomain-takeover fingerprints.

Cross-check every live host you found against the program's scope page before doing anything else with it.

Understanding Check

Verification Quest 01: Shadow Infrastructure Values

Can you explain why old/forgotten subdomains are often easier targets than the main site?

Verification Quest 02: Operational Attribution Modalities

Do you understand the difference between passive and active recon, and why passive comes first?

Verification Quest 03: Systematic Ledger Returns

Can you explain in your own words why mapping the attack surface before testing makes you more effective?

Verification Quest 04: Scope Discipline

Why can a subdomain that genuinely belongs to a company still be off-limits to test?

Verification Quest 05: Dangling Pointers

Explain the mechanism behind subdomain takeover — what makes a CNAME "dangling," and why does claiming the underlying resource give you control of the subdomain?

The Mindset Shift

The throughline of this entire chapter: recon is not a checklist you rush through once before the "real" testing starts. It's a discipline you return to, layer by layer, for every single target — and the hunters who consistently get paid are rarely running fundamentally different exploits than everyone else. They're looking at infrastructure nobody else bothered to map.