Replicating Flipboard Part I – Site Scraping
Having taken a long good look at the social magazine Flipboard, it was time to dig beneath the cover and contemplate what kind of technology lies behind its minimalistic interface.
I began by studying the inconspicuous yet critically important Reader feature. Personally, if Flipboard was to pop open mobile Safari each time I clicked into an article, I would have deleted the app right away, so that to me is the number 1 major feature. One of the wonders of Flipboard was how it manages to scrape relevant content from web sites in such an elegant and effortless manner.
Part of my job these years has been to do with web scrapping and crawling, and all the more because I know how scrappers and crawlers work, the speed and precision that Flipboard does it left me very impressed – there are apparently no machine learning models, no editorial selection, no reliance on meta data, no nonsense. Most of the time, you just throw any links at the app – Facebook, bitty, flickr, youtube vids, RSS, and It swallows it and churns out their core contents in the most aesthetically pleasing format like a humble neo-classism painter. What technical options out there are capable of performing such an achievement?
After days of testing and failing, I was prepared to throw in the towel, abandon the idea and go back to reading about iOS and cocos2d – until incidentally a little blue icon on the desktop screamed out at me. That was the “Reader” icon in Safari’s toolbar.
Introduced in Safari 5, the Reader is a feature that extracted the most prominent contents of a web page and displays them in a clutter free overlay. The Reader works best with pages that are text heavy, such as news articles and the like. Now that’s not unlike what Flipboard does.
Is there any chance that Flipboard employs a similar scraping technique as Apple? Would it take us a step closer if somehow we could crack Safari Reader’s secret? Clues were revealing themselves here and there. After a bit of googling, I landed on the site of the Readability project by a group called Arc90.
Since the JS isn’t very viable for running in non-browser environments, I reverted to the php port by Keyvan Minoukadeh. With some extra cleaning logic thrown on top, I was able to get very good results with the sample news article pages.
Surely the tool was not without limitations. With looser or pages that aren’t text centric, the results are often gibberish and unusable. For instance, the Readability would throw its hands up in the air when given a Flickr photoset URL.
Flipboard most probably employs several scrapping techniques suited to different types of web pages. With lots more to study, I shall leave the scraping part at that and revisit this space later.
Part II – Social Signals
Scraping and other issues:
- How @flipboard does image cropping. Interview of CTO @avh
- Is Flipboard legal
- From a lawyer’s point of view
- About Flipboard and reading surfaces