Crawling a Website with Katana for QA and Development

Katana is useful when you need a practical list of real pages on a site, especially before QA, redirects, migrations, or content cleanup work.

Crawling a website sounds dramatic, but most of the time it is just housekeeping. You want to know which pages exist, which URLs are noisy, and what QA or migration checks should start from.

Use this only on sites you own, maintain, or have permission to test. A crawler still creates real traffic, even when it is doing something useful.

A Katana Command I Start With

Katana is a command-line crawler from ProjectDiscovery. This example uses the system Chrome browser, crawls a few levels deep, filters common static assets, and writes the discovered URLs to a file.

katana `
  -depth 5 `
  -system-chrome `
  -u `
    'https://www.example.com/' `
  -crawl-out-scope `
    '.*\?' `
    '[a-f0-9]{32}\.aspx' `
    '/siteassets/' `
    '/globalassets/' `
  -extension-filter `
    '7z' `
    'aspx' `
    'css' `
    'doc' `
    'docx' `
    'jpeg' `
    'jpg' `
    'js' `
    'json' `
    'map' `
    'mp3' `
    'mp4' `
    'pdf' `
    'png' `
    'ppt' `
    'pptx' `
    'rar' `
    'svg' `
    'tiff' `
    'txt' `
    'ttf' `
    'webmanifest' `
    'webp' `
    'woff' `
    'woff2' `
    'xls' `
    'xlsx' `
    'xml' `
    'zip' `
  -output ./urls.txt

Replace https://www.example.com/ with the site you want to inspect.

What The Filters Are Doing

The goal is to keep the output useful for humans. QA usually needs page URLs first, not every font, image, PDF, source map, and tracking variation the crawler can find.

-depth 5 lets the crawler follow links several levels from the starting URL.
-system-chrome helps with sites that rely on browser-rendered navigation.
-crawl-out-scope removes URL patterns that are known to create noise, such as query-string URLs or generated asset paths.
-extension-filter removes common file extensions so the output stays focused on pages.
-output ./urls.txt saves the result so it can be shared, sorted, diffed, or used by another script.

For a first crawl, keep the filters strict. If the output looks too small, loosen one filter at a time. That makes mistakes easier to spot than changing five things and hoping the spreadsheet forgives you.

How I Use The Output

Once urls.txt exists, it becomes a simple input for other checks.

Compare old and new URLs during a migration.
Pick representative pages for responsive and accessibility testing.
Check whether redirects and canonical URLs behave as expected.
Find pages that should be removed, merged, or excluded from search.
Hand QA a concrete list instead of asking them to "click around a bit".

If the site already has a sitemap, start there too. Crawling and sitemap checks answer slightly different questions: the sitemap shows what the site claims is important, while the crawl shows what a browser-like visitor can actually discover.

For quick browser-side experiments, a small sitemap crawler script can still be handy. For repeatable QA work, Katana is usually the cleaner starting point.

Install WSL for Development and Fix Common Problems on Windows
How to install WSL for development on Windows, set up Git and .NET, reclaim virtual disk space, and fix TLS certificate errors behind a corporate proxy.
Setup mitmproxy to Intercept HTTP Traffic on Windows
How to install mitmproxy on Windows, trust its certificate, and route ASP.NET Core outbound HTTP traffic through it for debugging.
Markdown Coding Convention for Calm, Consistent Docs
A practical Markdown convention using rumdl to keep docs predictable across editors, pull requests, and formatting pipelines.
Bun's Rust Port and the New Reality of AI Engineering
A cautiously optimistic look at Bun's AI-assisted Rust port, and what it means for Zig, Rust, and the new reality of software work.

Crawling a Website with Katana for QA and Development

A Katana Command I Start With

What The Filters Are Doing

How I Use The Output

About