Heritrix Web Crawler
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.
Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.
System Runtime Requirements
Java Runtime Environment
The Heritrix crawler is implemented purely in java. This means that the only true requirement for running it is that you have a JRE installed.
The Heritrix crawler makes use of Java 5.0 features so your JRE must be at least of a 5.0 (1.5.0+) pedigree.
We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. See dependencies for the complete list (Licenses for all of the listed libraries are listed in the dependencies section of the raw project.xml at the root of the src download or here on sourceforge).
Hardware
Default heap size is 256MB RAM. This should be suitable for crawls that range over hundreds of hosts.
Linux
The Heritrix crawler has been built and tested primarily on Linux. It has seen some informal use on Macintosh, Windows 2000 and Windows XP, but is not tested, packaged, nor supported on platforms other than Linux at this time.
Heritrix Home Page
http://crawler.archive.org/
Heritrix Documentation
http://crawler.archive.org/articles/user_manual/index.html
Download Heritrix
http://crawler.archive.org/downloads.html