One of the potential pitfalls of writing spiders or any type of application that makes use of http requests is that a slow or intermittent connection to the destination server can make your application hang. While modules such LWP do have a timeout parameter, this parameter is implemented in a way that only works well for timeouts regarding non-responsive sites. Responsive, but very slow, sites will often cause LWP to keep the connection alive and result in your application hanging up for longer than you desire. One way to deal with this issue is to consider making use of the LWPx::ParanoidAgent module. The module is a derivative of LWP, but it does not base its timeouts on time since the last socket read, its timeout counter is initiated at the same time the request is made. Thus if you specify a 10 second timeout, 10 seconds is the maximum amount of time allotted for the completion of the request. This module is used almost identically to the LWP module. For example:
use LWPx::ParanoidAgent;
my $ua=LWPx::ParanoidAgent->new;
$ua->timeout(30); #in seconds
my $response=$ua->get('http://potentiallyslowsite.com');
my $result=$response->content;
my $ua=LWPx::ParanoidAgent->new;
$ua->timeout(30); #in seconds
my $response=$ua->get('http://potentiallyslowsite.com');
my $result=$response->content;
Another interesting feature of this module, is that it allows you to specify whitelists and blacklists to give you control over what links the module will actually attempt connecting to. While the near universality of LWP may often make it the better choice, the LWPx::ParanoidAgent module is worth keeping in mind for any project that may deal with http requests to sites with questionable network connectivity.