Replicating Flipboard Part I – Site Scraping

Having taken a long good look at the social magazine Flipboard, it was time to dig beneath the cover and contemplate what kind of technology lies behind its minimalistic interface.

I began by studying the inconspicuous yet critically important Reader feature. Personally, if Flipboard was to pop open mobile Safari each time I clicked into an article, I would have deleted the app right away, so that to me is the number 1 major feature. One of the wonders of Flipboard was how it manages to scrape relevant content from web sites in such an elegant and effortless manner.

Expanded web content view on Flipboard. wpcdn.padgadget.com

Part of my job these years has been to do with web scrapping and crawling, and all the more because I know how scrappers and crawlers work, the speed and precision that Flipboard does it left me very impressed – there are apparently no machine learning models, no editorial selection, no reliance on meta data, no nonsense. Most of the time, you just throw any links at  the app – Facebook, bitty, flickr, youtube vids, RSS, and It swallows it and churns out their core contents in the most aesthetically pleasing format like a humble neo-classism painter. What technical options out there are capable of performing such an achievement?

Safari Reader

After days of testing and failing, I was prepared to throw in the towel, abandon the idea and go back to reading about iOS and cocos2d – until incidentally a little blue icon on the desktop screamed out at me. That was the “Reader” icon in Safari’s toolbar.

Introduced in Safari 5, the Reader is a feature that extracted the most prominent contents of a web page and displays them in a clutter free overlay. The Reader works best with pages that are text heavy, such as news articles and the like. Now that’s not unlike what Flipboard does.

Content extracted by Safari Reader

Is there any chance that Flipboard employs a similar scraping technique as Apple? Would it take us a step closer if somehow we could crack Safari Reader’s secret? Clues were revealing themselves here and there. After a bit of googling, I landed on the site of the Readability project by a group called Arc90.

The main works of the Readability project is a JavaScript file that parses the dom tree of a document and extracts the section where the most relevant content lies. I threw a few URLs at Readability’s test page, and was pleasantly surprised  with how comparable the results were with Safari’s Reader, even for more complicated pages (some sources say that Safari does ship with a version of Readability.js bundled in).

Since the JS isn’t very viable for running in non-browser environments, I reverted to the php port by Keyvan Minoukadeh. With some extra cleaning logic thrown on top, I was able to get very good results with the sample news article pages.

Source: Times article versus  readability php extracted result

original article on Times.com

Left to right: Flipboard render, Safari 5 Reader, Readability php port

Source: A cartoon on cracked.com versus  cracked.com article scrapped by readability php

A cartoon page on cracked.com

Clockwise: Safari 5 wouldn't show "Reader" mode, Readability php extracted gibberish JS code, Flipboard gave up crawling page; shows url instead

Surely the tool was not without limitations. With looser or pages that aren’t text centric, the results are often gibberish and unusable. For instance, the Readability would throw its hands up in the air when given a Flickr photoset URL.

Flipboard most probably employs several scrapping techniques suited to different types of  web pages. With lots more to study, I shall leave the scraping part at that and revisit this space later.

Continued:
Part II – Social Signals

 

Scraping and other issues:

About these ads

8 Comments

  1. Hi kenshin,

    This replicating flipboard project sounds great. I’m really interested in seeing how you get on. I notice in this post you tried my PHP Readability port for site scraping. I use this in (and ported it for) a tool that would let people create full-text RSS feeds from partial ones: http://fivefilters.org/content-only/

    I’ve been thinking a little bit about how to improve the content extraction side of things. One idea is to supplement this auto-extraction with CSS/XPath patterns for specific sites. There are already two sets of community maintained patterns available that I know of.

    The largest database that I know of was pointed out to me by someone who came across my full-text feed project. It contains XPath patterns for (at time of writing) 2861 sites. You can see a list here: http://wedata.net/databases/LDRFullFeed/items
    Searching for ‘cracked’ shows up an entry for cracked.com: http://wedata.net/databases/LDRFullFeed/items?query=cracked

    Another, somewhat shorter, list of XPath patterns is maintained by the Instapaper project: http://www.instapaper.com/bodytext (you may need to login to see the list). It seems to contain just over 100 sites. Users can contribute patterns for sites they’d like supported, and in return the developer has agreed to share the database of patterns with anyone interested in using it.

    Finally, the strange Javascript code produced by the PHP port of Readability is a result of the PHP DOM processor not being able to handle script elements very well. In the Full-Text RSS project I get around this by running everything through Tidy first before giving it to PHP Readability (which relies on PHP’s DOM extension). Although in this case, doing so results in identification and extraction of the comments block rather than the main content. You can try it at: http://fivefilters.org/content-only/

    Cheers,

    Keyvan

    • Thanks Keyvan, that’s really an eye opener for me! I’m still poundering whether the Xpath-repository approach or having machine-learned classifiers work better for a media as volatiles as the web. Let me do a follow-up post later. Cheers for your great effort on the project!

      Btw, it seems that Flipboard really does make use of Readability, with additional “Regex based rule system for special case handling on a per-domain basis” – source. Impressive stuff.

  2. Thanks kenshin – would be very interested in seeing any machine-learning approach, although that’s really not a topic I’m at all familiar with. :)

    Interesting about Flipboard using readability too! :)

    • Andreas: I use the following (not sure if HTML elements are supported, so this may not appear properly):

      $tidy_config = array(
      ‘clean’ => true,
      ‘output-xhtml’ => true,
      ‘logical-emphasis’ => true,
      ‘show-body-only’ => false,
      ‘wrap’ => 0,
      ‘drop-empty-paras’ => true,
      ‘drop-proprietary-attributes’ => false,
      ‘enclose-text’ => true,
      ‘enclose-block-text’ => true,
      ‘merge-divs’ => true,
      ‘merge-spans’ => true,
      ‘char-encoding’ => ‘utf8′,
      ‘hide-comments’ => true
      );

      Although leaving all that out and relying on the defaults will be enough to fix the type of JS gibberish output kenshin encountered here.

      Also might be useful to say that in my tests I’ve come across a few sites where tidy chokes on the input and fails to produce anything at all – in these cases it would be better to bypass tidy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s