Low-memory, Human-readable, Injection-resistant, Data-interchange Format (LHIDF, pronounced “Lidif”)

I’m working on some mobile phone applications that have to accept data from sources outside of my application, although I can choose the data-interchange format those developing the data sources will have to utilize.  This brings to mind 2 big concerns when choosing which data-interchange format to utilize:

  1. Resources are few, so how do I limit the cost of parsing the incoming data?
  2. How do I ensure those providing the data have properly validated and escaped output coming to my application so-as to preserve the intended message structure?  Anyone familiar with SQL-injection, Email-injection, or XSS attacks will understand the general concerns described here.

There are several data-interchange formats I could use, including XML and JSON.  However, XML parsers can suck up large amounts of resources (and I can’t blame them, parsing XML can get pretty tricky), as can JSON parsers.  Additionally, both JSON and XML offer little in the way of  ensuring that the intended structure of the message is preserved even if one fails to properly validate incoming data that is used to generate the message.

To address these concerns, I’ve developed a new data-interchange format, which I’ve called the Low-memory, Human-readable, Injection-resistant, Data-interchange Format (LHIDF, and pronounced “Lidif”.)  Is the name a ridiculously long, forgettable acronym?  You betcha!

Anyways, let’s first look at a sample of LHIDF:

{leaders:256}
    {leader:256}
        {username:256}bill178{username:256}
        {rank:256}78{rank:256}
        {time:256}89.56{time:256}
    {leader:256}
    {leader:256}
        {username:256}thomas890{username:256}
        {rank:256}45{rank:256}
        {time:256}109.52{time:256}
    {leader:256}
{leaders:256}

Hopefully you agree that it is human readable.  I wouldn’t call it beautiful, but I can easily read the data and make edits as needed.

Every node follows the simple convention below:
{nodeName:messageID}content{nodeName:messageID}

Additionally, there are no attributes like those found in XML, there must be one top-level node, and each node can contain other nodes -OR- a simple value (unlike XML, nodes can’t contain a mixture of other nodes and simple values.)

Of note, the message ID is the same for all of the nodes, and we’ll get to its use in a bit.  The important point is that the format is really simple, and it’s this simplicity that facilitates the “low-memory” component of its name.  Because of the simple structure, a PHP class such as the one below is all that’s needed to parse a LHIDF response:

<?php
/**
 * Simple class to parse LHIDF (pronounced "Lidif") data-interchange format.
 * The Low-memory, Human-readable, Injection-resistant, Data-interchange Format.
 *
 * @author Adam Richardson of Envision Internet Consulting, LLC
 * @license http://www.opensource.org/licenses/mit-license.php MIT License
 */
class LHIDFNode {
    public $name;
    public $value;
    public static $messageID;
    /**
     * Create a new node object.
     * @param string $name
     * @param string $value
     */
    function __construct($name, $value) {
        $this->name = $name;
        $this->value = trim($value);
    }
    private static function setMessageID($data, $rootNodeName) {
        $temp = next(explode('{'.$rootNodeName.':', $data));
        $positionClosingCurly = strpos($temp, '}');
        self::$messageID = substr($temp, 0, $positionClosingCurly);
    }
    /**
     * Create the root level node and store the messageID.
     * @param string $data
     * @param string $name
     * @return Node
     */
    public static function getRootNode($data, $name) {
        self::setMessageID($data, $name);
        return new LHIDFNode($key = $name, $value = next(explode('{'.$name.':'.self::$messageID.'}', $data)));
    }
    /**
     * Retrieve the matching nodes found within the current LHIDFNode object.
     * @param string $name
     * @return array An array of LHIDFNode objects.
     */
    public function getChildren($name) {
        $tempArray = explode('{'.$name.':'.self::$messageID.'}', $this->value);
 
        if (count($tempArray) < 3) throw new Exception("The node $name could not be found.");
 
        $children = array();
 
        for ($i = 1; $i < (count($tempArray) -1); $i++) {
            if (strlen(trim($tempArray[$i])) < 1) {
                continue;
            }
            else {
                $children[] = new LHIDFNode($key = $name, $value = $tempArray[$i]);
            }
        }
 
        return $children;
    }
    /**
     * Retrieve the matching child node of current node (multiple matching nodes will throw an exeption.)
     * @param string $name
     * @return Node
     */
    public function getChild($name) {
        $tempArray = explode('{'.$name.':'.self::$messageID.'}', $this->value);
 
        if (count($tempArray) != 3) throw new Exception("The node $name could not be found.");
 
        return new LHIDFNode($key = $name, $value = $tempArray[1]);
    }
    /**
     * Retrieve the value of the current node.
     * @return string
     */
    public function value() {
        return $this->value;
    }
}
 
?>

You’ll notice the class is pretty small, and this class makes it easy to parse the contents of an LHIDF message (that said, you could take a more SAXy approach, as I’m doing in Objective-C, and do much better in terms of memory, this is just a simple example.)  To parse the example LHIDF, the code below would work:

<?php
 
class Leader {
	public $username;
	public $rank;
	public $time;
}
 
$rootNode = LHIDFNode::getRootNode($data, $name = 'leaders');
$leaderNodes = $rootNode->getChildren($name = 'leader');
$leaderObjects = array();
 
foreach ($leaderNodes as $leader){
	$newLeaderObj = new Leader();
	$newLeaderObj->username = $leader->getChild($name = 'username')->value();
	$newLeaderObj->rank = $leader->getChild($name = 'rank')->value();
	$newLeaderObj->time = $leader->getChild($name = 'time')->value();
	$leaderObjects[] = $newLeaderObj;
}
 
?>

So, what about the funny message ID’s, what do they do?  Well, the reason injection attacks work is that the attacker knows some of the general qualities of the intended structure of the message.

But, what if we wrote a data-interchange format and used it only once?  Attackers wouldn’t be able purposefully craft messages that alter the intended structure of a message if they didn’t know the message structure in the first place.  That said, changing the entire format every time you need to send a new message would prove very wasteful.  But, do you really need to change the entire format to neutralize injection attacks?

No, as it turns out you can just append a random message ID to the nodes your expecting, and now injection attacks are thwarted whilst maintaining a simplistic parsing scheme.  For instance, let’s say an attacker wanted to try and inject another leader node through the first username field, they might try injecting the following value:

                       user_name1{username:111}
        {rank:111}1{rank:111}
        {time:111}1{time:111}
    {leader:111}
    {leader:111}
        {username:111}extra_username

A new message ID is randomly generated every message, so the attacker would now have to successfully guess the message ID of a particular message to perform the attack. Because the message ID of the injection attempt above does not match the nodes of the current message, this username would all be parsed as one long, ugly username. No injection!

Now, you might be wondering why the message ID is so small. Eventually, an attacker could guess a 3 digit number. However, LHIDF doesn’t care how big the message ID is, it just uses the root node to determine it’s value and then parses accordingly, so if you want more security, you can have it by using a larger number. You just have to balance the security against the bandwith requirements of your particular application.

Finally, just to be clear, this isn’t a form of encryption or message signing.  There are already tried-and-true options out there for those needs.  This data-interchange format merely protects the intended structure of the message in this format, so if the message was intended to encode data for 2 leaders, the person decoding the message would also receive data for 2 and only 2 leaders.  If you go on to use the data from the message in a SQL query, as output on a XHTML page, or in an email, you’ll still have to properly validate and escape the data.  That said, I believe one could extend the above technologies to be injection resistant, too, such as augmenting XHTML so all tags and attributes include a randomly generated runtime message ID, thereby rendering injection attacks futile.