Is XML the best we can do?
I’ve been working on a web app lately that involves pulling in XML data via an API through PHP. I’ve come to the conclusion that XML is a pig to work with. First of all, the format sucks. It’s neither fish nor foul – not particularly human-readable and also difficult to parse with PHP. The whole concept of attributes is poorly thought out – whether a datum deserves it’s own tag or should be an attribute of a tag seems to be an entirely arbitrary decision. There are a number of ways to tackle the parsing job in PHP, including SimpleXML, XMLreader and some third party classes. I ended up using the built-in DOMDocument class as the best of a bad lot.
Then the fun started… DOMDocument seems to be very computationally expensive. My task involved reading in a small-ish number of XML pages via the api, stripping and sanitizing relevant data, and storing it in a database. This task would be performed daily with a cron job, to keep the database up-to-date. Despite optimising my code in every way possible, this simple code was taking an age to run. I’ve not profiled it yet, but I’m sure the DOM traversal is the bottleneck.
I had more problems with the API, to do with UTF-8 encoding and url encoding, and debugging wasn’t made any simpler by my IDE’s (Eclipse PDT with xDebug) inability to see inside the DOMDocument. I ended up having to use the class’s saveXML method just to look inside the data so I could debug the app.
Platform and language independence are hugely important for a universal data format. If you’re serving data via the web, you never know if your client will be a Linux host querying with ruby or an AS/400 using COBOL to ask for data. Remembering the horror of interfacing with CORBA technologies in the past, I can understand why people find the simplicity of XML attractive. I wonder why we can’t use a relational database querying system. In fact, this model is already in use, with widespread adoption by the serial innovators at Yahoo. Their YQL query language presents a wide range of common web data sources ready to be queried.
Pic by Kassel1
What about human readability? Not a problem really. RSS readers today assemble a raw XML feed into a nice list of magazine articles, because they know the XML structure of RSS feeds. We already have the technology to abstract relational querying language into a GUI tool that makes sense to humans, and giving us the ability to construct a personalised superfeed aggregated from multiple sources with filters to remove what we don’t want.
So I’m calling it – XML has served it’s purpose. We need a new universal data model, and it should be relational (and yes – I’m aware of the noSQL movement).