SYNOPSIS
harvestman [options] [-C configfile]DESCRIPTION
HarvestMan is a desktop WebCrawler written completely in the python programming language. It allows you to download a whole website from the Internet and mirror it to the disk for browsing offline. HarvestMan has many customizable options for the end-user. HarvestMan works by scanning a web page for links that point to other web pages or files. It downloads the files and copies them to the disk. HarvestMan maintains the directory structure of the remote website when it mirrors the website to the disk. Every html file is scanned like this recursively, till the whole website is downloaded.Once the download is complete, the links in downloaded html files are localized to point to the files on the disk. This makes sure that when the user browses the downloaded pages, he does not need to connect to the Internet again. If any file failed to get downloaded for some reason, HarvestMan will convert its relative Internet address to point to the complete Internet address, so that the user will be connected to the Internet when he clicks on the link, and does not get a dead-link error. (404 error).
From version 1.2, HarvestMan uses two family of threads, the "Fetchers" and the "Getters", for downloading. The Fetchers are threads which have the responsibility of crawling webpages and finding links and the Getters are threads which download those links (the non-html files).
HarvestMan, as of latest version is a console application. It can be launched by running the HarvestMan script (HarvestMan.py) if you are using the source code, or the HarvestMan executable, if you are using the executable (available on Win32 platforms).It prints informational messages to the console while it is working. These messages can be used to debug the program and locate any errors.
HarvestMan works by reading its options either from the command line or from a configuration file. The configuration file is named "config.xml" by default.
The is a major change from HarvestMan 1.5 onwards is that the configuration is now in an XML file called "config.xml". You can also use the convertconfig.py script, present in HarvestMan/tools/ of your installation to convert your configuration from text to XML and vice versa. For full details, see the Changes.txt file and see the website at http://harvestmanontheweb.com
HarvestMan writes a binary project file using the python pickle protocol. This project file is saved under the HarvestMan base directory with the extension .hbp. This is a complete record of all the settings which were used to start HarvestMan and can be read back later using the -- projectfile option to restart a HarvestMan project.
MODES OF OPERATION
HarvestMan has two major modes of operation. One is a fully multithreaded mode, also called as a fast mode.
- Fast Mode
-
Fast Mode is the most useful mode of HarvestMan. In this mode,
HarvestMan launches multiple threads for each url link, and stores
them in an internal queue. Also, HarvestMan will launch a separate
download thread for each non-html file encountered. This process is
very fast and you can download websites very quickly using this mode
as multiple downloads occur at the same time.
This mode is the default. You can use this mode if you have a relatively large bandwidth, and a reliable connection to the Internet.
Since HarvestMan is network-bound, using multiple threads speeds up the download.
- Slow Mode
-
In the Slow Mode, download of websites happen in a single thread, the main
program thread.
Each download will have to wait for the previous one to get completed, so
this is a relatively slow process. You can use this mode, if you have an
unreliable Internet connection or a relatively small bandwidth, which does
not support opening of multiple sockets at the same time.
This mode is disabled by default. You can enable it by setting the variable FASTMODE in the configuration file to zero. (Described somewhere in this document)
If you see a lot of "Socket" type errors when you launch a HarvestMan project by using the default mode (fastmode), switch to this mode. This would give you a very reliable download, though a slow one.
USAGE
As said earlier, HarvestMan reads its options from a configuration file or from the command line. The configuration file by default is named "config.xml". You can pass another configuration file name to the program by using the command line options -configfile/-C.HarvestMan can also read options from the command line.
From version 1.1, HarvestMan would also be able to read back previous project files by using the command line option -projectfile.
We will first discuss the structure of the configuration file and how it can be used to create a HarvestMan project. For more information on the command line arguments, run the program with the -help or -h option.
CONFIGURATION FILE
The configuration file is a simple text file with many options which are a pair of variable/value strings separated by tabs or spaces. Each variable/value pair appears in a separate line. Comments can be by adding the hash character '#', before any line.HarvestMan has three basic options and some 50 advanced options.
BASIC OPTIONS
HarvestMan needs three basic configuration options to work. These are described below:project.name: This is the project name of the current download. HarvestMan creates a directory of this name in its base directory (described below) where it keeps all the downloaded files. The project name needs to be a non-empty string. (Spaces are allowed.)
project.url: This is the starting url for the program from where it starts download. HarvestMan supports the WWW/HTTP/HTTPS/FTP protocols in this url. If a url does not begin with any of these, it will be considered as an HTTP url. For example, http://www.python.org, www.yahoo.com, cnn.com
project.basedir: This is the base directory for the program where it creates the project directories and stores all downloaded files. If this directory does not exist, HarvestMan will attempt to create it.
ADVANCED OPTIONS
For precisely configuring your download, HarvestMan supports about 30 advanced options. You will need to use many of them, if you would like to control your download exactly the way you want. The following section describes each of these settings and what they do. Read on.
- The Fetchlevel setting
-
From Version 1.2, there is a change in this setting. Read on.
This is one of the most useful options to tweak in a HarvestMan project. The option is controlled by the variable download.fetchlevel in the configuration file.
Make sure you read the following documentation very carefully.
When you are downloading files from a website, you would prefer to limit your download to certain areas of the Internet. For example, you might want to download all links pointed by the url http://www.foo.com/bar (a hypothetical example), that come under the www.foo.com web server. Or you might want to download all links under the directory path http://www.foo.com/pics and no more. You can use this option to do exactly that.
The option download.fetchlevel has 5 possible values that range from 0 - 4.
A value of 0 limits the download to a directory path from where you start your download. For example, if your starting url was http://www.foo.com/bar/index.html, this option makes sure that all links downloaded will be belonging to the directory url path http://www.foo.com/bar and below it. Any web links pointing to directories outside or other web servers would be ignored.
A value of 1 limits the download to the starting server, but does not limit it to paths below the starting directory.
For example, if your starting url was http://www.foo.com/bar/index.html, this option would also download files from the http://www.foo/com/other/index.html page, since it belongs to the starting server.
A value of 2 performs the next level fetching. It allows all paths in the starting server, and also all urls external to the starting server, but linked directly from pages in the starting server. For example, if your starting url http://www.foo.com/bar/index.html contained a link to http://www.foo2.com/bar2/index.html (an external server), HarvestMan will try to download this link also. But all urls linked linked from this link, i.e from http://www.foo2.com/bar2/index.html, would be ignored.
A value of 3 performs a fetching similar to above, but the difference is that it does not get files which are linked outside the directory of the starting url, but gets the external links which are linked one level from the starting url. For example, if your starting url <http://www.foo.com/bar/index.html> contained a link to http://www.foo2.com/bar2/index.html (an external server), HarvestMan will try to download this link also. But a url like <http://www.foo.com/other/index.html> (a link outside the starting url's directory) will be ignored.
A value of 4 gives you no control to the fetching process. It will allow all web pages to be downloaded, including web pages linked from external server links, encountered in the starting url's page. Setting this option will mostly result in the crawler trying to crawl the entire Internet, assuming that your starting url has links to other outside servers. Set this option, only if you are very sure of what you are doing. Any value above 4 has no special meaning, and would behave just like above.
For most downloads, this value can be specified between 0 and 2.
- The Depth Setting
-
This is another setting that gives you control over your download. It is denoted by the variable control.depth in the configuration file.
This value specifies the distance of any url from the starting url's directory in terms of the directory path offset. This is applicable only to the directories (links) in the starting server, below the starting url's directory. The default value is 10.
If a directory is found whose offset is more than this value, any links under it will not be downloaded.
You can specify zero depths in which case the download will be limited to files just below the directory of the starting url.
Examples: If the starting url is http://www.foo.com/bar/foo.html, then the url http://www.foo.com/bar/images/graphics/flowers/flower.jpg has a depth of 3 relative to the starting url.
- The External Depth Setting
-
This option also helps you to control downloads. It is denoted by variable control.extdepth in the configuration file.
This value specifies the distance of a url from its base server directory. This is applicable to urls which belong to external servers and to urls outside the directory of the starting url.
If a directory is found whose distance from the base server path is more than this value, any files under it will be ignored.
Note that this option does not support the notion of zero depth. A valid value for this has to be greater than or equal to one.
Examples: The url http://www.foo.com/bar/images.html has an external depth of 1 relative to the base server directory, http://www.foo.com.
- The External Servers Setting
-
This option tells the program whether to follow links belonging to outside web servers. This is denoted by the variable control.extserverlinks. By default, the program ignores external server links.
The option has lesser precedence to the download.fetchlevel setting. If download.fetchlevel is set to a value of 2 or above, this setting is conveniently ignored.
- The External Directories Setting
-
This option tells the program whether to download files belonging to outside directories ,i.e directories external to the directory of the starting url. This is denoted by the option control.extpagelinks in the configuration file.
This option tells the program whether to follow links belonging to outside directories.
The default value is 1 (Enabled). The download.fetchlevel setting has precedence over this value. If download.fetchlevel is set to a value of 1 or more, this setting is conveniently ignored.
- The Images Setting
-
Specifies the program whether to download images linked to pages. Enabled by default. This option is denoted by the variable download.images in the configuration file.
- The Html Setting
-
Tells the program whether to download html files. Enabled by default. Denoted by the variable download.html.
- Maximum limit of External Servers
-
You can put a check on the number of external servers from which you want to download files from, by setting this option to a non-zero value. It takes precedence to the download.fetchlevel setting. This option is controlled by the variable control.maxextservers in the configuration file. The default value is zero which means that this option is ignored.
To enable this option, set it to a value greater than zero.
- Maximum limit on External Directories
-
You can put a check on the number of external directories from which you want to download files from, by setting this option to a non-zero value. It takes precedence over the download.fetchlevel setting. This option is controlled by the variable control.maxextdirs, in the configuration file.
The default value is zero which means that this option is ignored.
To enable this option, set it to a value greater than zero.
- Maximum limit on Number of Files
-
You can precisely control the number of total files you want to download by setting this option. It is denoted by the variable, control.maxfiles. The default value is 3000.
- Default download of images
-
This option tells the program to always fetch images linked from pages, though they might be belonging to external servers/directories or might be violating the depth rules.
This option takes precedence over the control.extpagelinks/control.extserverlinks settings and the control.depth/control.extdepth settings.
The download.image setting has a higher precedence than this setting.
This option is enabled by default. Denoted by the variable download.linkedimages.
- Default download of style sheets (.css files)
-
Same as the above option, but only that this options checks for stylesheet (css) links. This has higher precedence over control.extpagelinks/control.extserverlinks and the control.depth/control.extdepth settings. Enabled by default.
This option is denoted by the variable download.linkedstylesheets.
- Maximum thread setting
-
This options sets the number of separate threads(trackers) launched by the program at a time. This is not an accurate setting. Note that a given time does not really mean that so many connections are running per second but only tells the program that it cannot launch threads above this limit.
This option makes sense only in multithreaded downloads, i.e, only when the program is running in fastmode. In slowmode, this setting has no effect.
Denoted by the variable system.maxtrackers. The default value is 10.
- Separate threads for file download
-
This option controls the ,multithreaded download of non-html files in the fastmode. In fastmode, separate download threads are launched to retrieve non-html files. If you disable this option, these files will be downloaded in the main thread of the downloader thread.
By default, this option is enabled. You can tweak it by the variable system.usethreads.
- Mode Selection
-
As described in the beginning, there are two modes for HarvestMan, the fast one and the slow one. This option allows you to choose your mode of operation.
The variable for this option is system.fastmode. The default value is 1, which means that the program uses fastmode. To disable fastmode, and switch to slowmode, set this variable to zero.
- Size of the thread pool
-
This value controls the size of the thread pool used to download non-html files when the program runs in fastmode and system.usethreads is enabled. The default value is 10.
This option is controlled by the variable system.threadpoolsize. It makes sense only if the program is running in fastmode and the system.usethreads option is enabled.
- Timeout value for a thread
-
This specifies the timeout value for a single download thread. The default value is 200 seconds. Threads which overrun this value are eventually killed and cleaned up.
This option is controlled by the variable system.threadtimeout.
This value is ignored when you are running the program in slowmode, without using multiple threads.
- Robot Exclusion Protocol
-
The Robot Exclusion Principle control flag. This tells the spider whether to follow rules specified by the robots.txt file on some web servers. Enabled by default.
We advice you to always enabled this option, since it shows good Internet etiquette and respect for the download rules laid down by webmasters of sites. Disable it after reading any legalities laid down by the website, according to your discretion. We are not responsible for any eventuality that arises from a user violating these rules. (See LICENSE.txt file.)
The variable for this value is control.robots.
- Proxy Server Support
-
HarvestMan is written taking into account corporate users (like the authors!) who connect to Internet from behind firewalls/proxies. Such users should set this option to the IP address/name of their proxy server with the proxy port appended to it.
The variables for this option are network.proxyserver and network.proxyport. Set the first one to the ip address/name of your proxy server and the second one to its port number.
Default values: proxy and 80.
Note: If you are creating the configuration file using the script provided for that purpose, the proxy server string would be encrypted and does not appear in plain text in the configuration file.
- Proxy Authentication Support
-
HarvestMan also supports proxies that require user authentication.
The variables for this are network.proxyuser and network.proxypasswd.
Note: If you are creating the configuration file using the script provided for that purpose, these values would be encrypted and does not appear in plain text in the configuration file.
- Intranet Crawling
-
This option is disabled from version 1.3.9 onwards since HarvestMan can now intelligently figure out whether url is in the intranet or internet by trying to resolve the host name in the url. Hence the option is not required anymore.
From version 1.3.9, we can mix urls in the internet/intranet in the same project.
- Renaming of Dynamically Generated Files
-
Dynamically generated files (images/html) will usually have file extensions that bear no connection to their actual content. You will not be able to open these files correctly, especially on the Windows platform which depends on file extensions to launch applications. This option will tell HarvestMan to try to rename these files by looking at their content. HarvestMan will also appropriately rename any link which points to these files.
This option right now works well only for gif/jpeg/bmp files. Disabled by default.
The variable for this option is download.rename.
- Console Message Settings
-
HarvestMan prints out a lot of informational messages to the console while it is running. These can be controlled by the project.verbosity variable in the configuration file. This value ranges from 0 to 5.
The default value is 2.
Here is each value and a description of its meaning to the program.
0: Minimal messages, displays only the Program Information/Copyright.
1: Basic messaging, displays above, plus information on the current project including the statistics.
2: More messaging, displays above, plus information on each url as it is being downloaded.
3: Extended messaging, displays above, plus information on each thread that is downloading a certain file. Also displays thread killing/joining information and directory creation, file saving/deletion information.
4: Debug messaging, displays above, plus debugging information for the programmer. Not recommended for the end-user.
5: Extended debug messaging, displays maximal messages, including the debug information from the web page parser. (Use this at your own risk!)
Please note that these guidelines are flexible and can change as new versions are being developed, especially the behavior of values from 3 - 5.
- Filters
-
HarvestMan allows the user to refine downloads further by specifying filtering options for urls. These are of two kinds:
1. Filters for urls (plain vanilla links), which are controlled by the control.urlfilter variable.
2. Filters for external servers, which are controlled by the control.serverfilter variable.
The filter strings are a kind of regular expression. They are internally converted to python regular expressions by the program.
Writing filter regular expressions
a. URL Filters (for the control.urlfilter setting)
URL filters supported by HarvestMan are of 3 types. These are:
1. Filename extensions 2. Servers/urls 3. Servers/urls + filename extensions
An example of the first type is *.gif
Examples of the second type are,: www.yahoo.com, */advocacy/*, */images/sex/*, */avoid.gif, ad.doubleclick.net/*
Examples of the third type are,: /images/*.gif, ad.doubleclick.net/images/*.jpg, yimg.yahoo.com/*.gif
You can build a 'no-pass' (block) filter by prepending a regular expression as described above with a '-' (minus) sign. (Example: -*.gif).
You can build a 'go-through' (allow) filter by prepending a regular expression as described above with a '+' (plus) sign. (Example: +*.gif).
You can concatenate regular expressions of the block/allow kind and create custom url filters.
Example: (Block all jpeg images, as well as all urls containing "/images/" in their path, but always allow the path "'/preferred/images/"):
-*.jpg+*/preferred/images/*-*/images/*
Example: (Block all gif files from the server "toomanygifs.com"):
-toomanygifs.com/*.gif
Example: (Block all files with the name "bad.jpg" from all servers.)
-*/bad.jpg
Example: (Block all jpeg/gif/png/ images but allow pdf/doc/xls files.):
-*.jpg-*.jpeg-*.gif-*.png+*.pdf+*.doc+*.xls
If there is a collision between the results of an inclusion filter and an exclusion filter, the program gives precedence to the decision of the filter which comes first in the filter expression. If there is still ambiguity, the inclusion filter is given precedence.
b. Server filters (for the control.serverfilter setting)
If you are enabling fetching links from external servers, you can write a server filter in a similar way to url filters. This also allows you to write no-pass and go-through filters. The main difference is that in urlfilters, the character "*" is ignored, whereas in server filters, this matches any character or sequence of characters.
Example: Block all files from the server adserver.com: -adserver.com/*
Example: Block all files from the server niceimages.com in the path /advertising/, but allow all other paths.
-*niceimages.com/*/advertising/*
Note that the control.serverfilter if specified, is checked before control.urlfilter. So any result of the control.serverfilter setting takes precedence.
- Retrieval of failed links
-
Tells the program whether to try refetching links that failed to retrieve at the end. Retry will be attempted by the number of times specified by this variable's value.
Retry will be attempted after a gap of 0.5 seconds after the first attempt for every url that failed due to a non-fatal error. Also retry will be attempted for all failed links once again at the end of the mirroring.
This option is controlled by the variable download.retryfailed. The default value is 1. (Retry will be attempted once for every failed link, and once again at the end of the download.)
To disable retry, set this variable to zero.
- Localization of URLs
-
Tells the program whether to localize (Internet links modified to file links) the links in all html files downloaded. This helps user to browse the website as if it were local. HarvestMan also converts any relative url links to absolute url links, if their files were not downloaded.
This is enabled by default. It is a good idea to always enable it.
Note that localization of links is done at the end of the download.
Controlled by the variable indexer.localise.
From version 1.1.2, this option supports 3 values. A value of zero of course disables it. A value of 1 will perform localization by replacing url links with absolute file path names.
A value of 2 will perform localization by replacing url links with relative file path names. Relative localization helps you to browse the downloaded website from different file systems since the url paths are relative (to directory). Absolute localization locks your downloaded website to the filesystem of the machine where you ran HarvestMan. From version 1.1.2, the default value of this option is 2, i.e it performs a relative localization by default.
Another variable related to localization has been added in the 1.1.2 release. This allows you to perform JIT (Just In Time) localization of html files, i.e, immediately after they are downloaded, instead of at the end of download.
This option is described somewhere below.
- URL List File
-
You can tell HarvestMan to dump a list of crawled urls to a file by setting this option. The variable for this is files.urlslistfile and is disabled by default.
- Error log file
-
A file to write error logs into. This by default is 'errors.log'. This file will be created in the project directory of the current project.
Variable: files.errorfile
Note: From version 1.2, this feature is disabled. Don't use it.
- Message Log File
-
From version 1.4 (this version), the message log file is named <project>.log for a project 'project' and is automatically created in the project directory of the project. This is not a configurable option anymore.
- Browse Index Page
-
HarvestMan creates an html project browser page in the Project Directory and appends the starting (index) files of each project to this page, at the end of each project. This option can be enabled or disabled by setting the variable display.browsepage By default, this is enabled.
- JIT Localization
-
HarvestMan, from version 1.1.2, has an option to localize HTML files immediately after they are downloaded, instead of at the end of the project. This option can be enabled by setting the variable, indexer.jitlocalise, to a value greater than zero.
By default this is disabled.
Note: From version 1.2, this option is disabled. Don't use it.
- File Integrity Verification
-
HarvestMan verifies the integrity of downloaded files by performing an md5 check summation check. From version 1.4 this option is disabled and is not available in the configuration file.
- Cookie Support
-
From version 1.2, we have added support for Cookies. The support is basic based on RFC 2109. By default cookies in web pages are saved in a cookie file inside the project directory and read back for pages which require these cookies. This can be controlled by the variable download.cookies. The default value is 1.
For disabling cookies, set this variable to zero (0).
- Files Caching
-
From version 1.2, we support caching/update of downloaded files. An binary cache file is created for every project. This file contains an md5 checksum of the file, its location on the disk and the url from which it was downloaded. Next time the project is re-started, the program checks the urls against this cache file. The files are downloaded only if their checksum differs from the checksum of the cached file, otherwise they are ignored.
This option is enabled by default. It is controlled by the variable control.pagecache. To disable caching, set this variable to zero (0).
From version 1.4, a sub-opton named control.datacache is available. If set to 1(default), data of each url is also saved in the cache file. So if you lose your original files, but the cach is present, HarvestMan can recreate the files of the project from the cache, if the cache files are not out of date.
You can enable data caching for small projects where the number of files downloaded are not too much. If the project downloads a lot of files, say > 5000, you might disable data caching.
- Number of Simultaneous Network Connections
-
From version 1.2, the number of simultaneous network connections can be controlled by modifying a config variable.
For all 1.0 (major) versions and the 1.2 alpha version, HarvestMan had a global download lock that denied more than one network connection at a given instant. This slowed down downloads considerably.
From 1.2 onwards, many simultaneous downloads (network connections) are possible apart from multiple threads. The number of simultaneous connections by default is 5. The user can change this by modifying the variable control.connections in the config file. If set to a higher value, the many download threads can use more connections at a given instant and download is faster. If set to a lower value, the threads will have to wait for a free connection slot, if the number of connections reach the limit. You can set it to reasonable value depending on your network bandwidth. A value below 10 is desirable for low-bandwidth connections and above 10 for high-bandwidth connections. If you have a broadband or DSL connection allowing very high speeds, set this to a relatively large value like 20.
It the number of connections is much less when compared to the number of url trackers, downloads will suffer. It is a good idea to keep these two values approximately the same.
- Project Timeout
-
From version 1.2 onwards, HarvestMan allows for a way to exit projects which hang due to some network or system problems in threading. The program monitors reads/writes from the url queue and keeps a time difference value between now and the last read/write operation on the queue. If no threads are writing to/reading from the queue, the program exits automatically if this time difference exceeds a certain timeout value. This value can be controlled by the variable control.projtimeout in the config file. Its value by default is 5 minutes (300 seconds).
- Javascript retrieval
-
From version 1.2, HarvestMan can fetch javascript source files (.js files) from webpages. This has been done by using an enhanced HTML parser that can download javascript files and java applets.
The variable for this is download.javascript. This option is enabled by default.
For skipping javascript files, set this option to zero(0).
- Java applets retrieval
-
From version 1.2, HarvestMan can fetch java applets(.class files) from webpages. This has been done by using an enhanced HTML parser that can download javascript files and java applets.
The variable for this is download.javaapplet. This option is enabled by default.
For skipping java applet files, set this option to zero(0).
- Keyword(s) Search ( Word Filtering )
-
This is a new feature from the 1.3 release. HarvestMan accepts complex boolean regular expressions for word matches inside web pages. HarvestMan will download only those pages which match the word regular expressions.
For example, to download only those webpages containing the words, HarvestMan and Crawler, you create the following regular expression and pass it as the config option control.wordfilter.
control.wordfilter (HarvestMan & Crawler)
Only the webpages which contain both these words will be spidered and downloaded. Note that the filter is not applied to the starting page.
This feature is based on an ASPN recipe by Anand Pillai available at the URL http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526
- Subdomain Setting
-
New feature from 1.3.1 release. HarvestMan allows you to control whether subdomains in a domain are treated as external servers or not, using the variable control.subdomain. If this is set to 1, then subdomains will be considered as external servers.
If set to 0, which is the default, subdomains in a domain will not be considered as external servers.
For example, if the starting server is http://www.yahoo.com, then if this variable is disabled (set to zero), then the domain, http://in.yahoo.com will be considered as part of this domain and not as an external server.
- Skipping query forms
-
To skip server side or cgi query forms, set this variable to 1. The variable is named control.skipqueryforms and is set to 1 (enabled) by default.
This skips links of the form http://server.com/formquery?key=value
To download these links set the variable to 0.
- Controlling number of requests per server
-
This is a new feature in version 1.3.2. You can control the number of simultaneous requests to the same server by editing the config variable named control.requests. This is set to 10 by default.
- Html cleaning up (Tidy Interface)
-
From version 1.3.9, HarvestMan has an option to clean up html pages before sending them to the parser. This allows to remove errors from web pages so that they are parsed correctly by the parser. This in turn helps to download web sites that otherwise might not get downloaded due to the parser errors of the starting html page, for example.
The tidylib source code is included along with HarvestMan distribution, so you don't need to install it separately.
This option is enabled by default and is controlled by the variable "control.tidyhtml".
- URL and Website Priorities
-
From this version onwards, HarvestMan allows the user to specify priorities for urls and servers.
Every url has a default priority, assigned based on its "generation". The generation of a url is a number based on the level at which the url was generated, based on the starting url. The starting url has a generation 0, all urls generated from it have a generation 1, and so on.
URLs with a lower generation number are given higher priiority when compared to urls with a higher generation. Also html /web page urls get a higher priority than other urls in the same generation.
User can specify his priority for urls by using the config variable named "control.urlpriority". This works on the basis of file extensions, and has a range from -5 to 5, -5 denoting lowest priority and 5 denoting maximum priority.
For example, to specify that pdf files should have a higher priority we can make the following entry in the config file.
control.urlpriority pdf+1
If you want to give word documetns a higher priority than pdf files, you can give the following priority specification.
control.urlpriority pdf+1,doc+2
Priroty settings are separated by commas.
If you want to put gif images at the lowest priority and jpg images at the highest priority,
control.urlpriority gif-5, jpg+5
Similar synatx can be used for setting server priorities. The variable named control.serverpriority can be used to control this.
Assume that you want to download files from the server http://yahoo.com with a higher priority when compared to the server http://www.cnn.com, in the same download project.
control.serverpriority yahoo.com+1, cnn.com-1
There can be other combinations also.
A priority which is lesser than -5 or greater than 5 is ignored by the config parser.
- Time Limits
-
From version 1.4, a project can specify a time limit in which to complete downloads. When this time limit is reached HarvestMan automatically terrminates the project by stopping all download threads and cleaning up.
This option can be specified by using the variable control.timelimit.
- Asynchronous URL Server
-
From 1.4 version, another way of managing downloads is available. This is an asynchronous url server, which serves urls to the fetcher threads. Crawler threads send urls to the server and fetcher threas receives them from it. The server is based on the asyncore module in Python, hence it offers superior performance and faster multiplexing of threads than the simple Queue. The server uses an internal queue to store urls which also increases performance.
If you enable the variable network.urlserver you can avail of this feature. This option is disabled by default.
The server listens by default to the port 3081. You can change it by modifying the variable network.urlport in the config file.
- Locale Settings
-
From 1.4 version, you can set a specific locale for HarvestMan. Sometimes when parsing non-English websites, the parser can fail to report some pages, because the language is not set to the language of the webpage. In such cases, you can manually change the language and other settings by changing the locale of HarvestMan.
Locale can be changed by modifying the variable system.locale . This is set to the american locale by default on non-Windows platforms and to the default locale ('C') on Windows platforms.
For example, if you see lot of html parsing errors when browsing a Russian site, you could try setting the locale to say 'russian'.
- Maximum File Size
-
A new option from version 1.4. HarvestMan fixes the maximum size of a single file as 1 MB. A url whose file size is more than this will be skipped. This can be controlled by the variable control.maxfilesize.
- URL Tree File
-
From version 1.4, a url tree file ,i.e a file displaying the relation of parent and child urls in a project can be saved at the end of the project. This file can be saved in two formats, in text or html. This option is controlled by the variable named files.urltreefile. The program figures out which format to use by looking at the file name extension.
- Ad Filtering
-
A new feature from version 1.4. URLs which look like adveritsement graphics or banners or pop-ups will be filtered by HarvestMan. This works by using regular expressions. The logic of this is borrowed from the Internet Junkbuster program. The option is control.junkfilter.
This option is enabled by default.
OPTIONS
- -h, --help
- Show help message and exit
- -v, --version
- Print version information and exit
- -p, --project=PROJECT
- Set the (optional) project name to PROJECT.
- -b, --basedir=BASEDIR
- Set the (optional) base directory to BASEDIR.
- -C, --configfile=CFGFILE
- Read all options from the configuration file CFGFILE.
- -P, --projectfile=PROJFILE
- Load the project file PROJFILE.
- -V, --verbosity=LEVEL
- Set the verbosity level to LEVEL. Ranges from 0-5.
- -f, --fetchlevel=LEVEL
- Set the fetch-level of this project to LEVEL. Ranges from 0-4.
- -N, --nocrawl
- Only download the passed url (wget-like behaviour).
- -l, --localize=yes/no
- Localize urls after download.
- -r, --retry=NUM
- Set the number of retry attempts for failed urls to NUM.
- -Y, --proxy=PROXYSERVER
- Enable and set proxy to PROXYSERVER (host:port).
- -U, --proxyuser=USERNAME
- Set username for proxy server to USERNAME.
- -W, --proxypass=PASSWORD
- Set password for proxy server to PASSWORD.
- -n, --connections=NUM
- Limit number of simultaneous network connections to NUM.
- -c, --cache=yes/no
- Enable/disable caching of downloaded files. If enabled, files won't be downloaded unless their timestamp is newer than the cache timestamp.
- -d, --depth=DEPTH
- Set the limit on the depth of urls to DEPTH.
- -w, --workers=NUM
- Enable worker threads and set the number of worker threads to NUM.
- -T, --maxthreads=NUM
- Limit the number of tracker threads to NUM.
- -M, --maxfiles=NUM
- Limit the number of files downloaded to NUM.
- -t, --timelimit=TIME
- Run the program for the specified time TIME.
- -s, --urlserver=yes/no
- Enable/disable urlserver running on port 3081.
- -S, --subdomain=yes/no
- Enable/disable subdomain setting. If this is enabled, servers with the same base server name such as http://img.foo.com and http://pager.foo.com will be considered as distinct servers.
- -R, --robots=yes/no
- Enable/disable Robot Exclusion Protocol.
- -u, --urlfilter=FILTER
- Use regular expression FILTER for filtering urls.
- --urlslist=FILE
- Dump a list of urls to file FILE.
- --urltree=FILE
- Dump a file containing hierarchy of urls to FILE.
FILES
config.xmlAUTHOR
harvestman was written by Anand Pillai <[email protected]>. For latest info, visit http://harvestmanontheweb.comThis manual page was written by Kumar Appaiah <[email protected]>, for the Debian project (but may be used by others).