Replicating Flipboard Part I – Site Scraping

Having taken a long good look at the social magazine Flipboard, it was time to dig beneath the cover and contemplate what kind of technology lies behind its minimalistic interface.

I began by studying the inconspicuous yet critically important Reader feature. Personally, if Flipboard was to pop open mobile Safari each time I clicked into an article, I would have deleted the app right away, so that to me is the number 1 major feature. One of the wonders of Flipboard was how it manages to scrape relevant content from web sites in such an elegant and effortless manner.

Expanded web content view on Flipboard. wpcdn.padgadget.com

Part of my job these years has been to do with web scrapping and crawling, and all the more because I know how scrappers and crawlers work, the speed and precision that Flipboard does it left me very impressed – there are apparently no machine learning models, no editorial selection, no reliance on meta data, no nonsense. Most of the time, you just throw any links at  the app – Facebook, bitty, flickr, youtube vids, RSS, and It swallows it and churns out their core contents in the most aesthetically pleasing format like a humble neo-classism painter. What technical options out there are capable of performing such an achievement?

Safari Reader

After days of testing and failing, I was prepared to throw in the towel, abandon the idea and go back to reading about iOS and cocos2d – until incidentally a little blue icon on the desktop screamed out at me. That was the “Reader” icon in Safari’s toolbar.

Introduced in Safari 5, the Reader is a feature that extracted the most prominent contents of a web page and displays them in a clutter free overlay. The Reader works best with pages that are text heavy, such as news articles and the like. Now that’s not unlike what Flipboard does.

Content extracted by Safari Reader

Is there any chance that Flipboard employs a similar scraping technique as Apple? Would it take us a step closer if somehow we could crack Safari Reader’s secret? Clues were revealing themselves here and there. After a bit of googling, I landed on the site of the Readability project by a group called Arc90.

The main works of the Readability project is a JavaScript file that parses the dom tree of a document and extracts the section where the most relevant content lies. I threw a few URLs at Readability’s test page, and was pleasantly surprised  with how comparable the results were with Safari’s Reader, even for more complicated pages (some sources say that Safari does ship with a version of Readability.js bundled in).

Since the JS isn’t very viable for running in non-browser environments, I reverted to the php port by Keyvan Minoukadeh. With some extra cleaning logic thrown on top, I was able to get very good results with the sample news article pages.

Source: Times article versus  readability php extracted result

original article on Times.com

Left to right: Flipboard render, Safari 5 Reader, Readability php port

Source: A cartoon on cracked.com versus  cracked.com article scrapped by readability php

A cartoon page on cracked.com

Clockwise: Safari 5 wouldn't show "Reader" mode, Readability php extracted gibberish JS code, Flipboard gave up crawling page; shows url instead

Surely the tool was not without limitations. With looser or pages that aren’t text centric, the results are often gibberish and unusable. For instance, the Readability would throw its hands up in the air when given a Flickr photoset URL.

Flipboard most probably employs several scrapping techniques suited to different types of  web pages. With lots more to study, I shall leave the scraping part at that and revisit this space later.

Continued:
Part II – Social Signals

 

Scraping and other issues:

About these ads