Virginia Decoded is Live

The very first State Decoded site went into public beta this morning: Virginia Decoded. This site was the initial one that snowballed into the State Decoded project, and proved to be a good testing ground for the software and, indeed, the concept. Virginia Decoded isn’t so much done as it’s done enough. There’s so very much more to be done, but the site has reached a point where it will benefit strongly from having actual people use it, and where actual people will—hopefully—benefit strongly from using it.

Virginia provides its code as SGML (which they, in turn, are provided by LexisNexis), making it relatively easy to extract the laws and store them in the State Decoded. Many states do not provide bulk downloads at all, so extracting their laws requires the laborious work of screen-scraping. Virginia is ahead in that regard.

More helpful than anything else was the Virginia Code Commission. That’s the official body that oversees the laws of the commonwealth, and they proved to be hugely helpful in testing the site. They provided invaluable information about how bills really become laws (it’s not as simple as you might think), and obsess about the details of the code the same way a great programmer obsesses about the details of…er…code. Without their input, Virginia Decoded would be a very convincing-looking but ultimately inaccurate website.

The process by which this website was put together is one that will be replicated in other states. We’ll find partners in states throughout the nation and, whenever possible, work with the state agency that oversees the state’s laws to craft a site that is the best fit for that state and its code. It will be a laborious process, but that’s what it takes to create a good, long-lasting network of state-level open government websites.

The Surprisingly Interesting History of the Virginia Code

Virginia Lawyer published an article in their February 2000 issue, by the UVA Law School’s Kent C. Olson, providing a fascinating history of the Code of Virginia. “The Path of Virginia Codification” explains how the Code of 1819 gave way to the Code of 1849, which was in turn replaced by the Code of 1887, then the Code of 1919, and finally the Code of 1950. Each new iteration didn’t just reflect changes in the law by intervening General Assembly sessions, but also a gradual rethinking of what a code should look like, how it should be organized, what purpose it should serve, and how it should be assembled.

Looking back, it seems ludicrous that the official collection of state laws would only be updated once a generation. All of the changes made by the biennial meetings of the legislature in the interim needed to be tracked, and a series of slim volumes had to be consulted to determine how the current law varied, if at all, from the last time that they were collected and printed.

The conclusion that I draw after reading this is that it’s time for the concept of printed legal volumes to disappear, at least as the primary venue for the dissemination of the text of the law. In most (all?) states, the printed edition is the canonical edition of the legal code, and everything else is basically for entertainment purposes only. Many states put a disclaimer on their code’s website to that effect. Illinois, for example:

This site contains provisions of the Illinois Compiled Statutes from databases that were created for the use of the members and staff of the Illinois General Assembly. The provisions have NOT been edited for publication, and are NOT in any sense the “official” text of the Illinois Compiled Statutes as enacted into law. The accuracy of any specific provision originating from this site cannot be assured, and you are urged to consult the official documents or contact legal counsel of your choice. This site should not be cited as an official or authoritative source.

One is tempted to conclude that states cannot long occupy this fantasy world, but the crudeness of most state’s websites for their codes support the notion that they may be able to spend many happy years in their world yet.

The Messiness of Real-World Data

Those of us who work with big data have a tendency to describe working with it in cavalier terms. “Oh, I just grabbed the XML file, wrote a quick parser to turn it into CSV, bulk loaded it into MySQL, laid an API on top of it, and I was done.” The truth is that things very rarely go so well.

Real-world data is messy. Data doesn’t convert correctly the first time (or, often, the tenth time.) File formats are invalid. The provided data turns out to be incomplete. Parser code that was so straightforward when written for the abstract concept of this data quickly turns into a series of conditionals to deal with all of the oddities of the real data.

While I’ve been developing the parser for The State Decoded, it’s become obvious that state laws themselves are too messy to standardize entirely, and the data formats in which states provide those laws are, in turn, too messy to import easily.

As a case study of the messiness of real-world data, here are some of the challenges I’ve encountered in parsing state legal codes.
 

Encoding Errors

A precious few states provide their state codes as bulk data. While it might seem like a real gift to get an XML file of every state law, if the XML is invalid, then that’s really more of a white elephant. One state provided me with SGML that they had, in turn, been provided with by LexisNexis. It was riddled with hundreds of errors, and could not be parsed automatically. After hours of attempting to fix problems by hand, I finally threw in the towel. Weeks later, LexisNexis provided a corrected version, and my work could continue.

Often, working with big data means doing pathbreaking work. The result is that sometimes nobody has ever before attempted to do anything with the data sets in question. Assembled by well-meaning but inexperienced people, those data sets may consequently be encoded incorrectly.

Changing Realities

State laws are occasionally restructured, renumbering huge portions of the code. Virginia’s entire criminal code—Title 18.1— became Title 18.2 about fifteen years ago. No redirects exist in 18.1, no pointers to the new location, no sign of what was. One must simply know that it changed. Court cases, articles from legal journals, or attorney generals’ opinions that cited sections of code within 18.1 are thus either useless or must be passed through a hand-crafted filter to point the citation to the new section number.

It would be nice if reality would consent to remain static to ease the process of cataloging it. But the world changes, and data reflects those changes. That can make it awfully frustrating to parse and apply that data, but that’s just the price of admission.

Inconsistencies in the Data

There are at least a few states who violate their standard state code structure. They might structure their code by dividing it into titles, each title into chapters, and each chapter into sections. Except, sometimes, when chapters are called “articles.” Why do they do this? I have no idea. If lawmakers consulted with database developers prior to recodifying their state’s laws, no doubt our legal codes would be normalized properly.

These inconsistencies might be illogical, but they’re how it is, and must be reflected in the final application of the data. This can be particularly frustrating if the provided bulk data doesn’t record these inconsistencies internally, requiring the gathering of external information to be applied to the data, as is often the case.

Missing Data

One state’s code contains periodic parenthetical asides like “See Editor’s note.” What does the editor’s note say? There’s no way to tell—it’s not part of the bulk data or the state’s official website for their legal code. Those editor’s notes will have to be obtained from the state’s code commission, which will probably come in the form of a bunch of Word files attached to an e-mail.

Not all data exists electronically, and not all data that exists electronically exists in a single location. Often, piecing together a meaningful data set requires gathering information from disparate sources, sometimes in awkward ways. And sometimes the last few bits of data just aren’t available, and the data set is going to have to be incomplete.
 

All of these problems are solvable, in one way or another, but those solutions can be time-consuming. The ratio of the Pareto principle applies here: one is liable to get 80% of the data set whipped into shape in the first 20% of the time. The remaining 20% of the data will require the remaining 80% of the time. That first 80% feels magical—everything just falling into place—but that last 20% is just plain hard work.

Real-world data is messy. Working with big data means cleaning it up.