Monday, February 08, 2010

Linking data in XML and HATEOAS

To continue in that vein, we start with Tim Berners-Lee. Tim Berners-Lee was talking at TED about linked data as being the next step for the web. He's thought deeply on the formats and protocols of the web, and I think he's right about the overall benefits of linking data. If you could link data together, you can easily join different datasets together simply by traversing it.

However, for a programmer implementing a web app, there's no immediate benefit of linking data. It doesn't show up in web browsers, usable semantic browsers are pretty much non-existent (maybe Disco), and none of the web frameworks makes it easy to link data. The fact that none of the maintainers of Rails, Django, etc do is indicative of the high cost to benefit ratio of doing so.

Taking a look at the Linked Data homepage, I felt the barrier was pretty high and heavyweight just to link data together. You'd not only have to learn RDF, but also OWL and SPARQL. And a simple search for RDF projects in github only reveals one project (reddy) with any attention from other devs with 3 forks and 39 watchers. It seems overcomplicated to have a separate RDF file linking data together.

While having ontologies is great, I don't think it's a low hanging fruit. I was searching about REST when I tripped on a concept called HATEOAS. It is a design constraint of REST that gets overlooked as using hypermedia as the engine of application state.

Given that idea, here's the punchline: Why don't we link data directly from within the XML data? Instead of messing around with RDFs, why can't we link in XML? Here's part of the data returned by the Sunlight Foundation's API when I query for a single legislator.



Notice that some of the fields are pointer to URLs, but for the twitter_id, it only gives the twitter id. With the state, it only states "HI" for Hawaii. Why not point twitter attribute at the URL of the twitter API? Instead of stating the state, why not link it to geonames.org? It might look something like this:



This way, you can traverse from XML document to XML document. So if you needed to look up further information about the attribute "state" with value "HI", you can do so by following the link to http://ws.geonames.org/search?name_equals=hawaii&country=US. So in your application, you can traverse it as if it was composed data. Now, when you execute @legislator.state, you don't only get back "HI", but you get back another set of data with attributes returned from geonames for the state of Hawaii.

You don't need to link to just other web services, but you can link back to your own API. Instead of having just the district number our legislator works in, Sunlight should link back to its own API for districts. When you do this, you push the burden of maintaining application state to the client. The state of the client is merely the XML document it last requested.

And more importantly, if different methods in your API need a specific order to be called, no longer will you need to state this in the documentation of your API. The only allowable methods to be called are the only href links in the XML document. It's best described by an example quoted by subbu:

There are three pages in a UI. The first page has a link to go to the second page. The second page has a link to go to the previous page as well as the third page. The third has a link to the second page and another link to the first page.

A client starts from the first page, and then through the link on that page, goes to the second page. The fact that this page has one link to the first page and another to the third page implies that the current state of the application (i.e. the interactions) is that "the client is viewing the second page". That is what it means by hypermedia as the engine of application state. It does not necessarily mean serializing application state, such as "<page>2</page>" into representations.

Right now, our REST APIs are returning XML with just IDs. It's up to you to figure out what they're pointing to. It's like if we had webpages that just said "next page" and expect you to know which URL to go to and just change it in the address bar of your browser

Obviously, I'm not the first to think about this. Tim Bray has talked about linking plain ole XML, and others have mentioned the xlink for xml, but nowhere in my searches did there seem to be any explicit connection with Tim-Berners Lee's Linked data or with finding a consistent way to access REST APIs. None of the XML data returned by REST APIs had links to them, and none of the API wrapper libraries I've used tried to traverse to a different REST API URI using links in the XML documents.

It seems like a really simple solution to linking data and it's way overlooked. When it all came together for me, it seemed like something people would be excited about, but doing a search on google, google trends, and google adsense keywords, no one seems to be talking about it.

Anyone know why XLink was abandoned, or why linked data doesn't follow this concept?

Posted via email from The Web and all that Jazz

5 comments:

  1. Interesting angle. See my 2c re HATEOS and linked data [1] - any comments?

    Just for the record: Linked Data *is* RESTful read-only Web Service style of data access.

    Cheers,
    Michael



    [1] http://webofdata.wordpress.com/2009/12/15/hateos-revisited-rdfa/

    ReplyDelete
  2. Here is the kicker in your post above re. Linked Data: "..It doesn't show up in web browsers.."

    Linked Data does show up in your browser (LINKs), it always has, the difference is that when you de-reference the LINKs you get granular structured data based on a standard data model: Entity-Attribute-Value Graph.

    Linked Data is fundamentally an unobstrusive tweak that makes the Web a bona fide distributed database comprised of network oriented records (resources, entities, data items) endowed with generic HTTP URIs (for identifying records, their attributes, and attribute values (optionally).

    ReplyDelete
  3. Vasiliy Faronov5:54 AM

    You say:

    "Why not point twitter attribute at the URL of the twitter API? Instead of stating the state, why not link it to geonames.org?"

    Suppose your application, written originally to work with the Sunlight Foundation's API, discovers this link to geonames.org and follows it. In response, it receives an XML document describing something cool about a geographical place. What next? How does your app make any sense of that document without knowing its schema beforehand? And of what use is this link if you can dereference it but not understand the results?

    This is (one) problem that the Semantic Web stack tries to solve. With RDF, there's a good chance that your app can make some use of data from a random site even if it wasn't originally designed to work with that site specifically. The following two traits help with this.

    1. RDF uses URIs both for resources and the properties describing them (which are actually resources themselves). Basically, people agree on which vocabularies to use for things, and then stick to those vocabularies to ensure interoperability. So when you stumble upon the URI "http://purl.org/dc/terms/title", you know that it denotes a document title, wherever you find it. You can do more or less the same thing with Namespaces in XML, but it's not used too widely, it doesn't provide an easy built-in way for referencing objects (resources), and I don't know of any equivalent for JSON.
    2. With ontologies, you can do logical inference on data, obtaining new data that hasn't been spelled out in your sources. This may include, for example, statements that "align" one vocabulary to another, making it possible for your app to "reduce" data to terms it already knows about.

    This is what makes RDF-based Linked Data actually useful: when you follow a link, there's a decent chance that you'll understand at least something from it.

    ReplyDelete
  4. Hrm, thanks, that explains quite a bit. Looks like I have to read more about RDF.

    So how come the XML returned from a REST request can't use URI for describing the properties?

    ReplyDelete
  5. XLink was introduced in 1998/9 to provide linking from XML. This was the NEW web of the time. People were thinking in terms of an extended HTML, in which xlink:href attribute could be applied to any element. This was to extend/replace the href that existed at the time in HTML. The value of xlink:href was to be an XPointer expression that enabled linking to any part of a target web page instead of just to particular anchors. So linking from plain XML has been around a long time. It was part of the design of GML (Geography Markup Language) from the very begining.

    ReplyDelete