Systematically enumerate a target's subdomains, historical endpoints, source-control exposure, and live infrastructure using a layered passive-then-active tooling pipeline — before issuing a single exploit payload.
Recon isn't the warm-up before "real hacking" in Chapters 7–11. On a mature target — one that's already been through a dozen pentests, with thousands of hunters poking the login form — the main application is usually the most hardened surface that exists. The bugs that actually pay are overwhelmingly found on infrastructure other hunters never looked at, because they never found it in the first place. A mediocre XSS skillset pointed at a forgotten staging server will outperform a brilliant XSS skillset pointed at a target everyone already tested. This chapter is where that leverage gets built.
Every program defines a scope, and reading it properly is a skill of its own — not just a glance at the asset table.
*.target.com) means everything under that root is fair game — generous, but still bounded. A wildcard on target.com says nothing about target-support.io, a separate brand domain the same company might own.*.target.com as in-scope and then exclude legacy.target.com by name in a paragraph below the table — because that subdomain belongs to a recently acquired company still running its old, unaudited stack.The habit to build now: read the entire scope page, not just the asset table. Note exclusions as carefully as inclusions. When in doubt about whether a discovered asset is covered, ask the program — most have a contact channel for exactly this — rather than assume. Recon tools have no concept of any of this. They will find what's there to be found, in-scope or not. The judgment about what to do with that information is entirely yours.
subfinder will happily enumerate subdomains that belong to a company but were explicitly excluded from their bounty program — common with recent acquisitions, legacy brands, or third-party-hosted services.
Testing or even aggressively scanning an out-of-scope asset can get you removed from a program, and in rare cases creates real legal exposure. Before you move from passive to active recon on any target, read the program's full scope page — exclusions as carefully as inclusions — and cross-check every asset you find against it.
Recon splits into two phases for a specific reason: risk management, not just convenience.
Passive recon means gathering information that already exists somewhere else — a certificate log, a search index, an archive, a public code repository — without ever sending a single packet to the target's own infrastructure. Nothing you do here is visible to the target, because you're not talking to them; you're talking to third parties who happen to have already collected information about them.
Active recon means directly interacting with the target's own systems — resolving their DNS, sending them HTTP requests, probing their ports. This is detectable. It can trip rate limits, get logged by a WAF, alert a SOC analyst watching a dashboard, or — if you've made a scope mistake — touch something you had no business touching. You do this second, after passive recon has already given you a strong picture, so that by the time you start generating detectable traffic, you're doing it efficiently and deliberately against assets you've already reasoned about, not blindly exploring.
Every publicly trusted TLS certificate issued by any certificate authority gets permanently logged in a public, append-only ledger system called Certificate Transparency (CT). It exists for a completely unrelated reason — to stop certificate authorities from secretly issuing fraudulent certificates for domains they don't control — but the side effect is one of the best recon sources available: a company cannot get a valid certificate for internal-staging.target.com without that hostname becoming permanently visible in CT logs, even years after the subdomain is decommissioned.
A few things worth understanding here, not just the tool:
*.target.com) show up in CT logs too, but they don't reveal individual subdomain names — they just confirm the org uses wildcard certs somewhere, which is itself a small signal about their infrastructure maturity.target.com might also cover www.target.com, api.target.com, and mail.target.com in the same SAN list, all visible in one CT log entry.crt.sh is the standard free web interface, and searching %.target.com (SQL wildcard syntax it supports directly) often surfaces subdomains nobody intended to advertise. For programmatic use, tlsx (ProjectDiscovery) queries CT logs and live TLS handshakes from the command line, and Censys offers a broader certificate and host search with a more powerful query language if you need to go deeper than crt.sh's interface allows.
Search engines have already crawled vast amounts of a target's public content, including things that were exposed by accident and indexed before anyone noticed. "Dorking" means using a search engine's advanced operators deliberately, instead of relying on luck.
The operators that matter most for recon work:
| Operator | Effect |
|---|---|
site: | Restrict results to one domain. |
inurl: | Match a string anywhere in the URL path. |
intitle: | Match a string in the page title. |
intext: | Match a string in the visible page body. |
ext: / filetype: | Restrict to a file extension. |
- (minus prefix) | Exclude a term from results. |
Combining extensions requires an explicit, capitalized OR inside parentheses — a bare pipe character is not valid search syntax on any major engine:
site:target.com (ext:env OR ext:log OR ext:sql OR ext:bak)
A few more patterns worth having memorized, because each targets a different class of accidental exposure:
site:target.com inurl:admin # exposed admin panels site:target.com intitle:"index of" # open directory listings site:target.com ext:pdf intext:confidential # leaked internal documents site:target.com inurl:wp-content -inurl:plugins # WordPress upload dirs, excluding plugin noise
One thing worth knowing about why this works at all: search engine caches lag behind the live site, sometimes by weeks. A dork can surface a file or page that's since been removed from the live server but is still sitting in the engine's cache — which is itself useful, because it tells you the route existed and may still resolve even if the link to it doesn't anymore.
The Internet Archive has been crawling and snapshotting the web since the late 1990s. The recon value isn't browsing old homepages for nostalgia — it's pulling the complete list of every URL the archive has ever observed for a domain, including query parameters and backend routes that vanished from the frontend years ago but may still resolve, because a redesign removed the link without anyone auditing whether the underlying server route was actually decommissioned.
Under the hood, this works through the Wayback Machine's CDX API — a queryable index of every snapshot the archive holds. A direct query looks like:
http://web.archive.org/cdx/search/cdx?url=target.com/*&output=text&fl=original&collapse=urlkey
That single request returns a flat list of every distinct URL the archive has under that domain. The CLI tool waybackurls (by tomnomnom) wraps this exact API so you don't have to construct the query by hand — feed it a domain, get back the URL list. Understanding that it's just a CDX query underneath means you can also hit the API directly when you need finer control, like filtering to a specific file extension or date range.
This is a passive technique that's easy to skip entirely if you're only thinking in terms of DNS and web crawlers, and it's consistently one of the highest-value sources available. Developers commit code to public or semi-public repositories constantly, and commit history is forever — deleting a file in a later commit does not remove it from the repository's history unless someone explicitly rewrites that history, which almost nobody does.
What to look for:
.env file or hardcoded API key committed once and "removed" in a later commit is still sitting in the git history of that repo, retrievable with git log -p or by browsing individual commits on GitHub directly.org:target-company password or org:target-company "api_key" inside GitHub's own search surface code that was never meant to be public, sitting in forks, gists, or old branches.gitleaks and trufflehog scan repositories (including full commit history, not just the current state) for patterns that look like credentials, tokens, or keys, automating what would otherwise be a lot of manual git log archaeology.A leaked cloud credential or internal API key found this way can be worth more than any web vulnerability you'll find later in this course — and it costs nothing but careful searching.
By this point you should already have a long list of candidate subdomains and historical URLs, all checked against scope. Now you start touching the target's own infrastructure, deliberately and in order.
DNS itself, queried directly, reveals more than most beginners expect.
dig axfr target.com @ns1.target.com
amass intel -org "Target Company" map a company name to its owned ASNs and IP ranges, which can surface infrastructure with no DNS record pointing at it at all, but that still responds if you connect to the IP directly.Passive sources tell you what existed at some point; they don't confirm what's alive now, and no single source is complete. subfinder (ProjectDiscovery) aggregates dozens of passive sources into one command:
subfinder -d target.com -o subs_subfinder.txt
Running a second tool with a different source mix and merging results catches what any single tool misses — assetfinder and amass enum -passive are the common companions here:
assetfinder target.com >> subs_assetfinder.txt amass enum -passive -d target.com -o subs_amass.txt cat subs_*.txt | sort -u > all_subdomains.txt
This combine-and-deduplicate habit matters more than people expect — different tools query different passive sources under the hood, and the overlap between any two tools' results is usually far from total.
Before assuming every live host only matters over HTTP/HTTPS, a fast port scan can reveal services nobody documented — an exposed database port, an admin interface on a non-standard port, a management panel that was never meant to be internet-facing at all. naabu (ProjectDiscovery) is built for exactly this — fast, designed to feed straight into the next tool in the pipeline:
naabu -l all_subdomains.txt -o open_ports.txt
Traditional nmap remains the deeper, slower option when you need service-version fingerprinting on a specific host rather than a fast sweep across many.
Once you know what's alive and which ports are open, httpx (ProjectDiscovery) sends real HTTP requests and reports status codes, response headers, page titles, and detected technologies in one pass:
cat all_subdomains.txt | httpx -o live_hosts.txt -title -tech-detect -status-code
This is the point where you're unambiguously generating traffic the target can see in their own logs. Confirm everything above is scope-checked before this step runs, not after.
Subdomains are one axis of attack surface; content within a single host is another, and there are two distinct things worth fuzzing for, not just one.
Path and file fuzzing brute-forces directory and file names against a target using a wordlist:
ffuf -u https://target.com/FUZZ -w wordlist.txt
Extending this to specific file extensions catches backup files and source artifacts a plain path wordlist misses:
ffuf -u https://target.com/FUZZ -w wordlist.txt -e .bak,.old,.zip,.sql,.env
Virtual host (vhost) fuzzing is a different technique entirely — instead of brute-forcing paths on a known host, you brute-force the Host header against a single IP address, looking for hidden virtual hosts that respond to a different hostname than the one you connected with. This catches internal applications co-located on shared infrastructure that were never given a public DNS record at all:
ffuf -u https://target.com -H "Host: FUZZ.target.com" -w subdomain_wordlist.txt -fs <baseline_response_size>
Both are noisier and slower than anything in the passive phase — point them only at confirmed, in-scope, live hosts.
Once a host responds, the response itself — and a couple of files most sites still serve by convention — tells you a great deal:
| Signal | What It Reveals |
|---|---|
Server / X-Powered-By headers |
Web server and backend framework — nginx, Apache, Express, a specific PHP version. |
| JS bundle filenames & inline comments | Frontend framework — React, Vue, Angular build artifacts, sometimes a literal version string. |
JS source maps (.js.map files) |
Can leak original, unminified source code if a build pipeline forgot to strip them in production. |
| Cookie names | Session framework — PHPSESSID, JSESSIONID, connect.sid for Express. |
robots.txt |
Often lists paths the developers specifically wanted hidden from search engines — which is exactly why a hunter should read it. |
sitemap.xml |
A structured map of every page the site wants indexed — frequently includes admin or staging paths accidentally left in. |
CNAME records |
Third-party services in use — S3 buckets, CDNs, SaaS platforms, and the starting point for subdomain takeover hunting (below). |
A robots.txt that disallows /admin-panel-v2/ is, ironically, one of the most reliable ways a site will tell you exactly where its admin panel lives.
Vulnerability classes aren't evenly distributed across technologies. An old, unpatched PHP backend raises your expectation of SQL injection or LFI. A target fronted by an S3 bucket raises your expectation of misconfigured bucket permissions. Knowing the stack tells you which later chapters of this course to reach for first on a given target — recon isn't separate from exploitation, it's what aims it.
This deserves its own section because it's a real, frequently-paid vulnerability class that exists entirely inside the recon process — you don't craft a payload for it, you find it by paying attention to what your subdomain list is telling you.
A subdomain has a DNS record — usually a CNAME — pointing to a third-party service (GitHub Pages, Heroku, an Azure App Service, an AWS S3 bucket, dozens of others). At some point the team stops using that service and deletes the resource on the third-party side, but forgets to remove the DNS record pointing to it. The DNS record is now "dangling" — it still resolves, but to a service that no longer exists. Because most of these platforms let any customer claim a hostname that isn't currently in use, an attacker can register that exact resource name on the third-party platform and suddenly control whatever the target's own subdomain serves.
Once you control the content served at forgotten.target.com, you can potentially read or set cookies scoped to the parent domain (if those cookies aren't properly scoped or flagged), bypass a Content Security Policy that explicitly trusts *.target.com as a script source, host a phishing page on a hostname users and browsers already trust, or intercept OAuth/SSO redirect flows that whitelist the subdomain as a valid callback target. This is consistently rated medium-to-critical depending on what the parent domain trusts that subdomain to do.
Look for subdomains whose CNAME points to one of the well-known vulnerable patterns (*.github.io, *.herokuapp.com, *.azurewebsites.net, *.s3.amazonaws.com, and dozens more — the community-maintained can-i-take-over-xyz repository tracks the current list and exact fingerprint for each service), then visit the subdomain and look for the specific error response that service shows for an unclaimed resource — GitHub Pages shows "There isn't a GitHub Pages site here," a dangling S3 bucket returns a NoSuchBucket XML error, and so on. Automated tools speed this up across a large subdomain list:
subzy run --targets all_subdomains.txt nuclei -l all_subdomains.txt -t nuclei-templates/takeovers/
A dangling CNAME pointing to a fingerprint match is a candidate, not a confirmed bug. Programs are explicit that they don't want theoretical submissions — claim the resource (only when the program's rules permit it), serve a harmless proof-of-concept page identifying yourself and a timestamp, capture evidence, and remove it promptly. This is one of the few vulnerability classes in this entire course where the proof-of-concept step requires you to actually claim infrastructure, which makes the ethics of doing it cleanly matter more than usual.
A live feed of raw recon hits is streaming in against devcore-infrastructure.io. Each finding needs to land in the right bucket of the attack-surface ledger — or get discarded, if it's not actually fair game. Read the scope notice first. Misfiling an in-scope asset costs accuracy; misfiling an out-of-scope one is the mistake that gets hunters removed from programs.
*.devcore-infrastructure.io is in-scope (wildcard).
EXCLUDED: legacy-auth.devcore-infrastructure.io (acquired entity, separate unaudited stack — explicitly out per program rules) and anything on devcore-support.io (different brand, different program).
Nothing above should live only in scrollback. Before testing anything, consolidate findings into a structured, living record — exactly the five buckets the console above just made you sort by hand:
| Ledger Element | What Goes In It |
|---|---|
| Subdomain Inventory | Every live host, tagged by purpose (production, staging, internal tool, third-party integration), CNAME target if relevant, cross-checked against scope. |
| Endpoint Map | Every distinct path and parameter observed, especially ones with obvious data-shape signals — id=, user=, redirect=, file= — the parameter types most associated with IDOR (Chapter 9), open redirects, and SSRF (Chapter 11). |
| Tech Stack Signatures | What's running where, so you know which later chapters' techniques to prioritize on which host. |
| Takeover Candidates | Any dangling CNAME matches, flagged for manual verification before any claim attempt. |
| Leaked Secrets / Source Exposure | Anything found in git history, exposed .env files, or source maps — handled separately and carefully, since these often need immediate, responsible disclosure rather than routine reporting. |
This ledger is what turns "I scanned a company" into "I have a prioritized, defensible plan." It also makes your eventual bug reports (Chapter 14) read very differently — a report that demonstrates you understood the target's structure lands very differently from one that looks like a lucky payload guess.
Can you explain why old/forgotten subdomains are often easier targets than the main site?
Do you understand the difference between passive and active recon, and why passive comes first?
Can you explain in your own words why mapping the attack surface before testing makes you more effective?
Why can a subdomain that genuinely belongs to a company still be off-limits to test?
Explain the mechanism behind subdomain takeover — what makes a CNAME "dangling," and why does claiming the underlying resource give you control of the subdomain?
The throughline of this entire chapter: recon is not a checklist you rush through once before the "real" testing starts. It's a discipline you return to, layer by layer, for every single target — and the hunters who consistently get paid are rarely running fundamentally different exploits than everyone else. They're looking at infrastructure nobody else bothered to map.