Crawling a Website with Katana for QA and Development
Katana is useful when you need a practical list of real pages on a site, especially before QA, redirects, migrations, or content cleanup work.
Crawling a website sounds dramatic, but most of the time it is just housekeeping. You want to know which pages exist, which URLs are noisy, and what QA or migration checks should start from.
Use this only on sites you own, maintain, or have permission to test. A crawler still creates real traffic, even when it is doing something useful.
A Katana Command I Start With
Katana is a command-line crawler from ProjectDiscovery. This example uses the system Chrome browser, crawls a few levels deep, filters common static assets, and writes the discovered URLs to a file.
katana `
-depth 5 `
-system-chrome `
-u `
'https://www.example.com/' `
-crawl-out-scope `
'.*\?' `
'[a-f0-9]{32}\.aspx' `
'/siteassets/' `
'/globalassets/' `
-extension-filter `
'7z' `
'aspx' `
'css' `
'doc' `
'docx' `
'jpeg' `
'jpg' `
'js' `
'json' `
'map' `
'mp3' `
'mp4' `
'pdf' `
'png' `
'ppt' `
'pptx' `
'rar' `
'svg' `
'tiff' `
'txt' `
'ttf' `
'webmanifest' `
'webp' `
'woff' `
'woff2' `
'xls' `
'xlsx' `
'xml' `
'zip' `
-output ./urls.txtReplace https://www.example.com/ with the site you want to inspect.
What The Filters Are Doing
The goal is to keep the output useful for humans. QA usually needs page URLs first, not every font, image, PDF, source map, and tracking variation the crawler can find.
-depth 5lets the crawler follow links several levels from the starting URL.-system-chromehelps with sites that rely on browser-rendered navigation.-crawl-out-scoperemoves URL patterns that are known to create noise, such as query-string URLs or generated asset paths.-extension-filterremoves common file extensions so the output stays focused on pages.-output ./urls.txtsaves the result so it can be shared, sorted, diffed, or used by another script.
For a first crawl, keep the filters strict. If the output looks too small, loosen one filter at a time. That makes mistakes easier to spot than changing five things and hoping the spreadsheet forgives you.
How I Use The Output
Once urls.txt exists, it becomes a simple input for other checks.
- Compare old and new URLs during a migration.
- Pick representative pages for responsive and accessibility testing.
- Check whether redirects and canonical URLs behave as expected.
- Find pages that should be removed, merged, or excluded from search.
- Hand QA a concrete list instead of asking them to "click around a bit".
If the site already has a sitemap, start there too. Crawling and sitemap checks answer slightly different questions: the sitemap shows what the site claims is important, while the crawl shows what a browser-like visitor can actually discover.
For quick browser-side experiments, a small sitemap crawler script can still be handy. For repeatable QA work, Katana is usually the cleaner starting point.