The Three Pillars of Social Reader Relevancy (I)

In Web Search, the ranking of results is primarily determined by their freshness, relevancy (in regard to a search query) and content quality. Freshness is indisputable and needs little explanation, relevancy is an approximation of how much data a web site contains that have something to do with the user’s query and content quality is an indication of how “good” the site’s information is, given factors like PageRank, spam scores and so on.

Once crossed the line into the mobile word, however, these three factors lose their usefulness drastically. Text input on mobile devices is largely impractical and traditional web pages don’t render well, so media discovery and consumption on mobile devices is generally inferior compared to the same experience from printed mediums like newspapers and magazines. While big name players attempt to tackle the issue simply by snapping on extra features (Google Mobile Voice Search, Google Instant Preview for Mobile.etc.), the underlying problems remain resolved as its ranking algorithm is the same as its desktop counterpart. Flipboard is so great because, in my opinion, it has found and defined the new three pillars of relevancy for mobile content consumption and they are freshness, social, and readability* – and they work wonderfully.

With this in mind, I put some work to the server components of Cassius. From a simple script that turns a Tweet into a JSON feed, the pipeline now includes saving documents into a transitional store (MongoDB) and a series of quality measurement calculations. While the extra processing means we won’t be able to serve the feed in realtime, the cost should be worthwhile and I hope the results justify that.

How well does your article read?

In Zite or Flipboard, it’s not uncommon to run into articles with summary texts that resemble gibberish (see below). The issue is often a result of incorrect identification of raw HTML elements as meaningful content, and is very hard to avoid. I have seen attempts to solve this problem using NLP and machine learning classification methods, to varying degrees of success. Since those are beyond my capabilities, I opted to use some traditional methods to measure the quality of a piece of writing – by taking its readability metrics. From Wikipedia, readability evaluation refers to “the ease in which text can be read and understood“, and “…various factors to measure readability have been used, such as speed of perception, perceptibility at a distance, perceptibility in peripheral vision, visibility, the reflex blink technique, rate of work, eye movements, and fatigue in reading…“.  Readability metrics measurement tools are widely available, and embedded in word processors and email clients.

results of bad scraping

In a nutshell, the tools apply different statistical formulas on a piece of English writing, and the resulting scores form an impression of its understandability. The formulas typically break text into syntactic components such as words and sentences and count their distribution or frequency in relation to the text being analyzed. The most common readability formulas and descriptions are given below:

I found it more pleasing to read blog posts and articles on Flipboard/Zite that are about a page in length. Contents that span multiple pages are too demanding for casual reads, while short tweets or one liners aren’t worth the two clicks effort to expand and shrink them from the page (yes really). For simplicity, let’s take my reading habits as standard, and use the following thresholds for computation:

  • Flesch50 (Times magazine has a score of about 52)
  • Flesch-Kincaid13 (pre-college level)
  • Gunning Fog12 (texts for wide audience have fog index of less than 12)
  • SMOG – 13 (pre-college level)
  • Coleman Liau – 13 (pre-college level)
  • ARI – 13 (pre-college level)
The gist here takes the readability scores, evaluate their distances from the threshold and combine the scores as a mean. Very straightforward.

In the next post we’ll continue to explore the three pillars, and look at some test results to see whether the additional aspect of readability would help us create a feed that is better optimized for the user’s final reading experience.