Cool! Thank you. I accidentally found Arthurs Classic Novels quite a while ago, but I'd forgotten about it and he seems to have added a lot more since I last looked.
[rubs hands together mwahahahah] :) I just finished a fairly long download (681 MB) of all the science fiction titles there using this sweet command: wget -m -k -p -E -l 0 -np http://arthursbookshelf.com/sci-fi/index.html
wget downloads stuff to the current directory, that is, wherever your terminal is when you issue the command (use pwd to find out where that is)
-N don't re-retrieve files unless they're newer (useful if you need to interrupt and restart later)
-l inf infinite levels of recursion depth
-nr don't remove '.listing' files (useful for FTP sites)
-k converts all links to point to local files (so the pages work offline)
-p gets all parts of pages (e.g. images) even if outside the recursion tree
-E save all html pages with .html extension (good for php and asp pages)
-l 0 (same as -l inf) seems to be needed sometimes even with -m
-np don't ascend to parent directories
Some sites need -e robots=off which disobeys the robots.txt restriction, preventing robots from entering certain areas in a site. Arthur's Classic Novels doesn't. In any case, be very careful with ignoring the robots.txt restrictions. It can be for your own good as well as theirs because you could end up downloading volumes of useless data from old forums wasting time, bandwidth, and disk space.
Another thing to be careful of is don't use the full power of a high-speed connection for this. If you have some net limiting software then good, otherwise you can use -wait=SECONDS to wait SECONDS between retrievals and --limit-rate=RATE to limit the download rate to RATE.
wget has incredible capabilities. You can add a list of filetypes you want, or a list of filetypes you want to specifically exclude. You can also exclude certain directories. It can do lots, lots more.
no subject
Date: 2010-09-26 06:59 am (UTC)Here's a bunch of e-book stuff you might like, if you don't know already:
http://www.metafilter.com/95773/Arthurs-Classic-Novels-his-Love-of-Mankind-and-the-Internet
no subject
Date: 2010-09-26 08:11 am (UTC)By the way, I've added you to my circle too.
no subject
Date: 2010-09-27 04:25 am (UTC)I just finished a fairly long download (681 MB) of all the science fiction titles there using this sweet command:
wget -m -k -p -E -l 0 -np http://arthursbookshelf.com/sci-fi/index.htmlwget downloads stuff to the current directory, that is, wherever your terminal is when you issue the command (use pwd to find out where that is)
Some sites need -e robots=off which disobeys the robots.txt restriction, preventing robots from entering certain areas in a site. Arthur's Classic Novels doesn't. In any case, be very careful with ignoring the robots.txt restrictions. It can be for your own good as well as theirs because you could end up downloading volumes of useless data from old forums wasting time, bandwidth, and disk space.
Another thing to be careful of is don't use the full power of a high-speed connection for this. If you have some net limiting software then good, otherwise you can use -wait=SECONDS to wait SECONDS between retrievals and --limit-rate=RATE to limit the download rate to RATE.
wget has incredible capabilities. You can add a list of filetypes you want, or a list of filetypes you want to specifically exclude. You can also exclude certain directories. It can do lots, lots more.
no subject
Date: 2010-09-27 07:58 am (UTC)wget http://www.gutenberg.org/files/30452/30452-h.zip
wget http://www.gutenberg.org/files/33016/33016-h.zip
wget http://www.gutenberg.org/files/30124/30124-h.zip
wget http://www.gutenberg.org/files/31168/31168-h.zip
wget http://www.gutenberg.org/files/31893/31893-h.zip
wget http://www.gutenberg.org/files/30166/30166-h.zip
wget http://www.gutenberg.org/files/30532/30532-h.zip
wget http://www.gutenberg.org/files/29390/29390-h.zip
wget http://www.gutenberg.org/files/29768/29768-h.zip
wget http://www.gutenberg.org/files/30691/30691-h.zip
wget http://www.gutenberg.org/files/28617/28617-h.zip
wget http://www.gutenberg.org/files/30177/30177-h.zip
wget http://www.gutenberg.org/files/29198/29198-h.zip
wget http://www.gutenberg.org/files/29848/29848-h.zip
wget http://www.gutenberg.org/files/29607/29607-h.zip
wget http://www.gutenberg.org/files/29809/29809-h.zip
wget http://www.gutenberg.org/files/29919/29919-h.zip
wget http://www.gutenberg.org/files/29882/29882-h.zip
wget http://www.gutenberg.org/files/29255/29255-h.zip
I should have really got the script to unzip and rename the files for the name and date of the magazines too, but I'll do that later.
no subject
Date: 2010-09-26 06:58 am (UTC)