A Practical Exercise in Web Scraping
Yesterday a friend of mine linked me to a fictional web serial that he was reading and enjoying, but could be enjoying more if it was available as a Kindle book. The author, as of yet, hasn't made one available and has asked that fan-made versions not be linked publicly. That said, it's a very long story and would be much easier to read using a dedicated reading app, so I built my own Kindle version to enjoy. This post is the story of how I built it.
Step 1: Source Analysis
The first step of any kind of web scraping is to understand your target. Here's what the first blog post looks like (with different content):
<h1 class="entry-title">The Whale</h1>
<div class="entry-content">
<p>
<a title="Next Chapter" href="http://example.com/the/next/chapter">
Next Chapter
</a>
</p>
<p>"And what tune is it ye pull to, men?"</p>
<p>"A dead whale or a stove boat!"</p>
<p>
More and more strangely and fiercely glad and approving, grew the
countenance of the old man at every shout; while the mariners
began to gaze curiously at each other, as if marvelling how it
was that they themselves became so excited at such seemingly
purposeless questions.
</p>
<p>
But, they were all eagerness again, as Ahab, now half-revolving in
his pivot-hole, with one hand reaching high up a shroud, and
tightly, almost convulsively grasping it, addressed them
thus:—
</p>
<p>
<a title="Next Chapter" href="http://example.com/the/next/chapter">
Next Chapter
</a>
</p>
</div>
After browsing around I found a table of contents, but since all of the posts were linked together with "Next Chapter" pointers it seemed easier to just walk those. The other interesting thing here is that there's a comment section that I didn't really care about.
Step 2: Choose Your Tools
The next stage of web scraping is to choose the appropriate tools. I started with just curl
and probably could have gotten pretty far I knew the DOM futzing I wanted to do would require something more powerful later on. At the moment Ruby is where I turn to for most things, so naturally I picked Nokogiri. The first example on the Nokogiri docs page is actually a web scraping example, and that's basically what I cribbed from. Here's the initial version of the scraping function:
def scrape_page(url)
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
content = doc.css('div.entry-content').first
title = doc.css('h1.entry-title')
next_url = ""
content.search('a[title="Next Chapter"]').each do |node|
next_url = node['href']
node.parent.remove
end
{
title: title,
content: content,
next_url: next_url
}
end
Ruby has a built-in capability for opening URLs as readable files with the open-uri
standard library module. Because of various problems with Nokogiri's unicode handling I learned about in previous web scraping experiences, the best thing to do is to pass a string to Nokogiri instead of passing it the actual IO handle. Setting the encoding explicitly is also a best practice.
Then it's a simple matter of using Nokogiri's css
selector method to pick out the nodes we're interested in and return them to the caller. The idea is that, since each page is linked to it's successor we can just follow the links.
Step 3: The Inevitable Bugfix Iteration
Of course it's never that easy. Turns out these links are generated by hand, and across hundreds of blog posts of course there will be some inconsistencies. At some point the author stopped using the title attribute. Instead of using the super clever CSS selector a[title="Next Chapter"]
I had to switch to grabbing all of the anchor tags and selecting based on the text:
content.search('a').each do |node|
if node.text == "Next Chapter"
next_url = node['href']
end
node.parent.remove
end
This works great, except that in a few cases there's some whitespace in the text of the anchor node, so I had to switch to a regex:
content.search('a').each do |node|
if node.text =~ /\s*Next Chapter\s*/
next_url = node['href']
end
node.parent.remove
end
Another sticking point was that sometimes (but not always) the author used non-ASCII in their URLs. The trick for dealing with possibly-escaped URLs is to check to see if decoding does anything. If it does, it's already escaped and shouldn't be messed with:
def escape_if_needed(url)
if URI.unescape(url) == url
return URI.escape(url)
end
url
end
Step 4: Repeat As Necessary
Now that we can reliably scrape one URL, it's time to actually follow the links:
task :scrape do
next_url = 'http://example.com/the/first/chapter/'
sh "mkdir -p output"
counter = 0
while next_url && next_url =~ /example.com/
STDERR.puts(next_url)
res = scrape_page(next_url)
next_url = res[:next_url]
title = res[:title].text
File.open("output/#{sprintf('%04d', counter)}.html", "w+") do |f|
f.puts res[:title]
f.puts res[:content]
end
counter += 1
sleep 1
end
end
This is pretty simple. Set some initial state, make a directory to put the scraped pages, then follow each link in turn and write out the interesting content to sequential files. Note that file names are all four digit numbers so that the sequence is preserved even with lexicographical sorting.
Step 5: Actually Build The Book
At first I wanted to use Docverter, my project that mashes up pandoc and calibre for building rich documents (including ebooks) out of plain text files. I tried the demo installation first, but that runs on Heroku and repeatedly ran out of memory so I tried a local installation. That timed out (did I mention that this web serial is also very long?) so instead I just ran pandoc
and ebook-convert
directly:
task :build do
File.open("input.html", "w+") do |f|
Dir.glob('output/*.html').sort.each do |filename|
f.write File.read(filename)
end
end
STDERR.puts "Running conversion..."
sh("pandoc --standalone --output=output.epub --from=html --to=epub --epub-metadata=metadata.xml --epub-stylesheet=epub_stylesheet.css input.html")
sh("ebook-convert output.epub output.mobi")
end
Pandoc can take multiple input files but it was easier to manage one input file on the command line. The stylesheet and metadata xml files are lifted directly from the mmp-builder
project that I use to build Mastering Modern Payments, with appropriate authorship information changes.
In Conclusion, Please Don't Violate Copyright
Making your own ebooks is not hard with the tools that are out there. It's really just a matter of gluing them together with an appropriate amount of duct tape and bailing twine.
That said, distributing content that isn't yours without permission directly affects authors and platform shifting like this is sort of a gray area. The author of this web serial seems to be fine with fan-made ebooks editions as long as they don't get distributed, so that's why I anonymized this post.