« March 2008 | Main

June 2008 Archives

June 1, 2008

JoBo, crawler program to download complete websites to computer

JoBo is a simple program to download complete websites to your local computer. Internally it is basically a web spider. The main advantage to other download tools is that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling. Compared to other products the GUI seems to be very simple, but the internal features matters ! Do you know any download tool that allows it to login to a web server and download content if that server uses a web forms for login and cookies for session handling? It also features very flexible rules to limit downloads by URL, size and/or MIME type.

For programmers it features a very flexible object model and is easily expandable - expect new modules in the future ! It is implemented in Java and the source code is available. If you want to implement your own web spider, the WebRobot class will be a good starting point. Even if you don't want to use it as a download tool but for indexing, link checking or whatever you want, JoBo is the right tool. Retrieving documents and handling these documents are completely seperated - therefore you can plug in your own module easily.

Features

* command line and graphical version (but command line version needs a major update, currently the GUI version has much more features)
* recursive search of all documents starting from a given start document
* support of tags (with fault tolerance)
* support of the robot exclusion protocol
* user controlled maximal search depth
* user agent name can be defined
* support of referrer headers
* support of automated form handling (JoBo can fill fields with predefined values)
* cookie support
* XML configuration
* used bandwidth can be limited
* allow/deny downloads by mime type and document size (e.g. ignore all image/* files)
* allow/deny downloads by regular expressions (e.g. don't download /cgi-bin)
* can convert absolute links to relative
* download only files newer then a given age
* resume job

JoBo Crawler Home Page
http://www.matuschek.net/jobo/

JoBo Crawler Download
http://www.matuschek.net/jobo-download/

WebLech URL Spider

WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.

Similar in some aspects to tools such as wget (in recursive retrieval mode), WebSuck or Teleport Pro, WebLech allows you to "spider" a website and to recursively download all the pages on it. You can then browse the site offline for your convenience, or even "mirror" the website and re-publish it yourself. Note that WebLech is not suited to downloading single URLs -- use wget for this kind of thing.

Features

WebLech has a number of features that make it useful:

* Open Source MIT Licence means it's totally free and you can do what you want with it
* Pure Java code means you can run it on any Java-enabled computer
* Multi-threaded operation for downloading lots of files at once
* Supports basic HTTP authentication for accessing password-protected sites
* HTTP referer support maintains link information between pages (needed to Spider some websites)
* Lots of configuration options:
o Depth-first or breadth-first traversal of the site
o Candidate URL filtering, so you can stick to one web server, one directory, or just Spider the whole web
o Configurable caching of downloaded files allows restart without needing to download everything again
o URL prioritisation, so you can get interesting files first and leave boring files till last (or ignore them completely)
o Checkpointing so you can snapshot spider state in the middle of a run and restart without lots of processing.

WebLech URL Spider Home Page
http://weblech.sourceforge.net/

Download WebLech URL Spider
http://prdownloads.sourceforge.net/weblech/weblech-0.0.3.tar.gz?download

About June 2008

This page contains all entries posted to Open Source Java Community and OpenJDK Resources. Latest News, podcasts, Updates, downloads. in June 2008. They are listed from oldest to newest.

March 2008 is the previous archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35