Reuters News & the Semantic Web

As time goes on, i’m becoming more and more confident of the potential of the semantic web – typically, I tend to be a ‘second mover’ when it comes to new technologies or trends, I’ll usually let someone else be the early adopter, or at least remain skeptical until a quorum has emerged. But with the semantic web, and Sir Tim Berners-Lee open, linked data vision, I was sold instantly – it just seemed to make sense.

As a result, the subject of my recent thesis was focused on certain semantic web concepts in relation to news headline development, specifically looking at ‘best practice’ elements which form an objective, relevant, descriptive and cognitively cost-effective headline, to see if a best practice framework, or methodology, could be derived from from user preference. Initial results appeared to confirm my hypotheses, but along the way, I began to explore the interesting parallel with linked data standards.

Typically, news headlines tend to follow the Subject–verb–object – Wikipedia, the free encyclopedia (S-V-O) structure utilised in general linguistic typology; for example, taking a collection of Reuters headlines (Subject–Verb-Object), we can see that same structure present –

Islamic State battling Kurdish forces in Northeast Syria
Pakistan paramilitary raids HQ of major party MQM in volatile Karachi
Obama announces changes for student loan repayment
PayPal sets up Israeli security center, buys CyActive

Each of these headlines initially follows the S-V-O triple structure, with a little more information appended to the end of the initial triple. Prior research indicates that this structure is somehow more initially ‘obvious’ to human psychology, easier to process cognitively, and interestingly, this same structure is used in the RDF specification, a declarative language influenced by ideas from knowledge representation i.e. language classification. Within the RDF world, information is presented in a Subject-Predicate-Object triple, identical to the linguistic Subject-Verb-Object triple. For example, if we take one of the headlines above i.e. the PayPal entry, run this through and entity extraction tools such as Calais Viewer and parse the resulting RDF using the W3C RDF Validation Service, I end up with a set of triples that look very similar to the linguistic subject-verb-object triple –

PayPal sets up Israeli security center, buys CyActive
Subject: http://d.opencalais.com/er/company/ralg-tr1r/58dfdbf1-c0c8-3859-ad42-c9c6de8ca6e7 Predicate: http://s.opencalais.com/1/pred/tickerObject: “EBAYP”

Roughly translated from the URI, this is telling us “PayPal Inc ticker symbol is EBAYP“. So within the news headline triple, we can see embedded RDF triples based upon the particular entity, and in many cases, RDF triples can be directly transposed into headlines themselves (although in this case the RDF triple is probably not exactly news worthy!).

I’m not breaking any new ground here, simply re-stating what is already known, but conceptually thinking about how we access triples helps us understand how we can derive value from that information. In the same way we use language to retrieve information from a S-V-O triple in our people interactions i.e. Chris works at Thomson Reuters, we can access similar information from RDF triples using a query language such as SPARQL Query Language for RDF. And that’s really the idea behind the semantic web, to promote a common framework that allows us to share data – making the connection between something ‘real’ like our Reuters news editorial function, and the work our Big, Open, Linked Data (BOLD) team is undertaking around Linked data.