A web crawler usually starts working on a given URL to find and visit all the links that can be followed. But a search engine crawler doesn't care much about the content it misses. While better, it is not supposed to parse all of the dynamic content.
Thanks to SEO, it is now the webmaster's duty to make the site search engine friendly.
The crawler part of a webapp security scanner(WASS) is very important for its success. Thus a scanner's crawler has to parse all of the dynamic content, fight with bad HTML code to find all of the pages, variables etc.
Usually crawling a whole web site can take a lot of your time. Here is an overview of how we can do a Quick Scan to find remote web application/site vulnerabilities, with minimum crawler requests. This also helps us to see a quick overview of the site structure, and sometimes we can find web application vulnerabilities faster if we are doing a quick assessment in a large network.
Robots.txt - The famous Robots Exclusion Protocol
If you don't know how and why a /robots.txt file is used, please read About /robots.txt.
I think some readers skip this part: "don't try to use /robots.txt to hide information"
If a search engine crawler wants to visit and index a website, it first accesses /robots.txt file to find the rules it should follow. If a link is forbidden, it will be excluded by the search engine crawler bot.
Use these rules in your crawler to find more information, like new directories and links.
Sitemaps
"Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site."*
I love sitemaps. They allow me to find most of the links and create the directory structure of the whole web site with 2-3 web requests! Of course we have to look for more pages that don't exist in sitemaps.
There are many reasons to use sitemaps. A huge website can use sitemap protocol to have search engine crawlers to index the required content only, when necessary. Also a webmaster can point out existing links that the search engine couldn't find and index.
When a webmaster reads Google's FAQ for Submitting a Sitemap they usually place their sitemap file in /sitemap.xml or /sitemap.xml.gz
Most of the times, they also add their sitemap location to /robots.txt file.
The sitemap location can be added to /robots.txt file as:
Sitemap: http://domain.com/sitemap_location.xml
Another thing related with the sitemap protocol is the Sitemap index file. This file includes other sitemap locations of the target website. Multiple large sitemap files can be split and listed in a sitemap index file, then submitted to search engines as a single entry.
Conclusion - How to find almost all URLs of a target website with minimum web requests?
My approach for a quick scan of a standard web application is:
- Request and parse the URL entered by the WASS user in the scan settings (1 request)
- Request /robots.txt, parse and extract links, sitemap index files and sitemaps (1 request)
- Request default sitemap and sitemap index file locations, parse and extract links out of them (6 requests - or more/less depending on your estimated default sitemap locations)
- Webmasters sometimes create a site map page for visitors. Using this page a visitor can quickly access the content he is looking for. In the response page of your first request, search for links containing the word "sitemap" or similar, to find these pages. Access these manually created sitemap files first! (1 request)
Using this methodology in our scanners crawler component, we do less than 10 request to remote web server and can;
- Extract the directory/file structure of the target web server/application,
- Possibly learn used technology based on found file extensions,
- Find variables, if links like "news.asp?id=123" exist,
- Find almost all URLs, even if your WASS can't parse JavaScript/Flash content!,
- You may also find pages that we can't find during a normal crawling operation (Some webmasters misconfigure sitemap generators which lead to disclosure of sensitive files),
- Visit these newly found URLs to find more links, variables, weaknesses etc.
- And more!