Replicating Flipboard Part II – Social Signals

Image: http://www.flickr.com/photos/19marksdesign/

A quick update on the project.

Most people never stray far from Flipboard’s default sections (e.g. Tech, Gaming, Fashion) which feature collections of articles from sources hand-picked by the Flipboard editorial team. Pay closer attention though, and you would notice that the laying out of those articles in the Flipboard grids were not simply based on published time or source. Length of the headline, size of the embedded image, similarity of the topics – all these factors seem to come into consideration. Then of course the social strength (tweet count, retweet count, Facebook likes.etc.) of a particular article is one major relevancy factor too. In my opinion, the adoption of a brand new ranking paradigm – social strength, was the single most ground-breaking thinking that was in Flipboard’s design, not unlike how Google invented PageRank and changed the game on search.

This post is not about Flipboard’s layout algorithm (I hope I could do one someday), but rather a quick detour into social signals scraping. The more social signal information we could gather about any article, the easier it would be for us to rank them in terms of interestingness.

After a few hours of fiddling (wasted countless moments not realizing twitter has removed basic authentication support), the two pieces of information I was looking for became accessible.

Twitter Shares

This couldn’t be simpler. Just do a curl to the Count API:

curl "http://urls.api.twitter.com/1/urls/count.json?url=web_site_url"
e.g. curl "http://urls.api.twitter.com/1/urls/count.json?url=http://edition.cnn.com/2011/WORLD/europe/02/10/index.html"

which returns

{"count":4,"url":"http://edition.cnn.com/2011/WORLD/europe/02/10/egypt.protests.london/index.html/"}

Although there’re some limitations. Most notably is how it doesn’t follow redirects and treats different addresses of the same web page as totally separate:

/* all have different results */
curl "http://urls.api.twitter.com/1/urls/count.json?url=http://www.facebook.com/BufSabres/posts/10150099820062954"

curl "http://urls.api.twitter.com/1/urls/count.json?url=http://fb.me/R97GMpug"

curl "http://www.facebook.com/BufSabres/posts/10150099820062954?somefakeparam"

We could definitely use some pre-processing before calling the Count API.

Facebook Likes

Then there’s the beast of Facebook-Likes. Poised as an innocent function for simply declaring your interest on any entity (a web page, a person, an event.etc.), Likes is a tour de force from Facebook’s arsenal for pushing forward their Social Graph vision. When a Like button is added to any web page, that page automatically becomes a living entity in Facebook’s Open Graph repository. In other words, if there’s a Like button added on your web site, Facebook would be able to analyze what type of entity your site represents (a blog, a movie,  a sport team.etc.) and whom expressed interest on it. Now whether that ambition of categorizing the entire web and building social graphs around them is a move forward towards a semantic web or a potentially huge exploit of privacy is for another discussion.

As of now, Facebook’s API does not have a convenient way of returning the Likes count, so as usual, one has to do it via scrapping. The official documentation points to the “Facebook URL Linter” for obtaining the Social Graph info stored by Facebook on any web site. On the URL Linter page, the unique Social Graph ID for a web site would be shown. We could grab that ID and then invoke the Graph API to obtain the Likes count.

e.g. The Matrix page on RottenTomatoes has a Social Graph ID of 119655798047100, and from “http://graph.facebook.com/?id=119655798047100″ we know it’s been liked on Facebook 571 times.

{
   "id": "119655798047100",
   "name": "The Matrix",
   "picture": "http://profile.ak.fbcdn.net/hprofile-ak-snc4/162036_119655798047100_2294421_s.jpg",
   "link": "http://www.rottentomatoes.com/m/matrix/",
   "category": "Movie",
   "website": "http://www.rottentomatoes.com/m/matrix/",
   "description": "Average Rating: 7.4/10 tttReviews Counted: 126 tttFresh: 109 | Rotten: 17",
   "likes": 571
}

Except that for reasons unknown, the Linter does not always return the Social Graph ID, like “http://developers.facebook.com/tools/lint/?url=http%3A%2F%2Fwww.oscars.com”, “http://developers.facebook.com/tools/lint/?url=http%3A%2F%2Fwww.apple.com%2F” and many others.

For that matter, perhaps it is perhaps easier for us to bypass the Graph API altogether and directly scrape the iframe that site owners use to add the Like buttons.

All that’s required is a curl followed by some ugly regex filtering code on the returned markup from line 22:

<div id="connect_widget_4d6b574b6fe272202491168" class="connect_widget">
<table class="connect_widget_interactive_area">
<tbody>
<tr>
<td class="connect_widget_vertical_center connect_widget_button_cell">
<div class="connect_button_slider">
<div class="connect_button_container">
<a class="connect_widget_like_button clearfix like_button_no_like">
<span class="liketext">Like</span></a></div>
</div></td>
<td class="connect_widget_vertical_center">
<div class="connect_confirmation_cell connect_confirmation_cell_no_like">
<div class="connect_widget_text_summary connect_text_wrapper">
<span class="connect_widget_facebook_favicon"> </span>
<span class="connect_widget_user_action connect_widget_text hidden_elem">You like <strong>The Matrix</strong>.<span class="unlike_span hidden_elem">
<a class="mls connect_widget_unlike_link">Unlike</a></span>
<span class="connect_widget_admin_span hidden_elem">
<a class="connect_widget_admin_option">Admin Page</a><span class="connect_widget_insights_span hidden_elem">
<a class="connect_widget_insights_link">Insights</a></span></span>
<span class="connect_widget_error_span hidden_elem">
<a class="connect_widget_error_text">Error</a></span></span>
<span class="connect_widget_summary connect_widget_text"><span class="connect_widget_connected_text hidden_elem">You and 611 others like this.</span>
<span class="connect_widget_not_connected_text">611 likes.
<a href="/campaign/landing.php?campaign_id=137675572948107&partner_id&placement=like_button&extra_2=HK" target="_blank">Sign Up</a> to see what your friends like.</span>
<span class="unlike_span hidden_elem">
<a class="mls connect_widget_unlike_link">Unlike</a></span>
<span class="connect_widget_admin_span hidden_elem">
<a class="connect_widget_admin_option">Admin Page</a><span class="connect_widget_insights_span hidden_elem">
<a class="connect_widget_insights_link">Insights</a></span></span>
<span class="connect_widget_error_span hidden_elem">
<a class="connect_widget_error_text">Error</a></span></span></div>
</div></td>
</tr>
</tbody>
</table>
<div id="connect-widget-comment-box-markup">
<!--
<div class="connect_widget_comment_box hidden_elem">
<div class="connect_widget_comment_box_upward_nub"></div>
<div class="connect_widget_comment_area">
<table class="uiGrid" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td><input type="text" class="inputtext connect_widget_comment_textinput DOMControl_placeholder" placeholder="Share it on Facebook with a comment..." value="Share it on Facebook with a comment..." title="Share it on Facebook with a comment..." /></td>
<td><label class="connect_widget_comment_button hidden_elem uiButton uiButtonConfirm" for="u033146_1"><input value="Post" type="submit" id="u033146_1" /></label></td>
</tr>
</tbody>
</table>
</div>
</div>
--></div>
</div>

So here we are, able to extract Twitter Shares and Facebook Likes for any web site. To complete the test, I rounded up a handful of random web sites from Delicious.com and re-ranked them based on social strength. The result was quite clear in terms of showing what kind of web sites might be more appealing to the masses.

Reranking with Social Strengths

Random sites from Delicious, with their social strength scores extracted and normaled

Before and After

A bunch of latest sites from Delicious

Same group of sites, ranked by their Social Strength

Previous:
Part 1: Site Scraping

2 responses on “Replicating Flipboard Part II – Social Signals

  1. The problem with simplistic ranking of web sites with social signals, as the observing eye can see from the screenshots above, is that top-level web sites (e.g. “engadget.com”) are always going to have stronger signals than, say, a popular article that has only been posted the night before.

    Thus, efforts dealing with social signals would need to be concerned with monitoring how a resource’s social signals changed overtime. Resources with sudden spikes in Facebook-likes should in fact be treated with higher priority.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s