Version 0.8 Released

Version 0.8 of The State Decoded is now available on GitHub. This is our biggest release to date, in part because it’s a combination of what was meant to be both the v0.8 and the v0.9 releases. That means that it took twice as long to produce this release as was planned, but it was worth it. It’s comprised of 709 Git commits, more changes that were committed for all prior eight versions combined. This is the final major release prior to v1.0. The two biggest improvements are a total overhaul of the user interface with a responsive design and the integration of the Apache Solr search engine. Here is a rundown of the major changes:

New User Interface

v8John Athayde and Lynn Wallenstein of Meticulous were responsible for dropping the old design and creating a new one from scratch. They’ve rendered the site almost unrecognizably different, which a highly modular, flexible design that looks good on screens of all sizes, is easy to customize, and will continue to grow with the State Decoded project. The layout is coded in SASS, which is handy for a lot of designers, and employs a pluggable template system that makes it easy to drop in custom designs. It even has a nice print stylesheet for folks who still like to have a hard copy in their hand. There’s no overstating what a huge improvement that this is.

Search Engine

It’s really insulting to call this a mere search engine. The team at Open Source Connections—John Berryman and Doug Turnbull, in specific—designed an implementation of Apache’s Solr search platform that’s optimized for legal data. Of course, we used this to add a site search engine with spellcheck and live search suggestions, but having legal data indexed by Solr (and the underlying Lucene system) facilitates all sorts of exciting new features in the realms of natural language processing, information retrieval, content recommendations, and machine learning.

Code Base Overhaul

The complexity of The State Decoded and the needs of its users outgrew the code base and data structure that had functioned from v0.1 through v0.7. Long-time State Decoded contributor Bill Hunt took on this project, completing it in just a few weeks, implementing a routing table, moving a lot of functionality into controllers, and routing all queries to a permalinks table. The old approach looks pretty clunky compared to what Bill built.

Setup Simplified

We’ve tested out The State Decoded on the major Linux distributions and hosting platforms, identified all of the changes that were needed to allow it to work smoothly in those environments, and modified the software accordingly. There’s now an automated environmental test suite, to make sure that The State Decoded will work properly, with multiple paths of accomplishing the same task, to require as little work as possible to configure the server. The installation process has fewer steps than ever, as everything that can possibly be automated is automated.

Plus

There’s an entire workflow to handle new editions of codes, bulk downloads are created automatically, a sample XSLT is included for State Decoded XML, there’s a default home page that doesn’t require any customization, it now supports non-unique section identifiers (!), there’s an API method for search, there’s a proper administrative section now, we have an assets repository for the design’s Photoshop files, a sitemap.xml is automatically generated, every law now has Dublin Core embedded, and dozen of other things.

The work done by Bill Hunt, Meticulous, and Open Source Connections wouldn’t be possible without the generous support of The John S. and James L. Knight Foundation, whose funding made it possible to hire them. And Bill’s ongoing work on The State Decoded is courtesy of his employer, The OpenGov Foundation, who now employes him to contribute to the project and to implement The State Decoded in cities and states around the U.S.

Thanks, too, to Chris Birk, Karl Nicholas, Nick Skelsey, Josh Tauberer, and Rostislav Tsiomenko for their suggestions, testing, and contributed code.

First Documentation Release

The first draft of The State Decoded’s documentation is now available. Documentation was being built up piecemeal on a GitHub wiki, but it’s been moved to its own GitHub repository and a dedicated website. The pages are created in a mix of HTML and Markdown (Markdown has difficulty with some of the sample XML), and the site is built in Jekyll, which means that any changes made in the documentation GitHub repository are reflected promptly on the documentation website. This makes it simple for anybody to update the documentation—to fix a mistake, add an example, or even add a whole new section.

There’s a great deal more to be done with the documentation. It needs to be organized, structured narratively, enhanced with illustrations, and simply cover more material. But it’s not bad, and now it’s easy for others to help make it better.

Two Mini-Projects: Subsection Identifier and Definition Scraper

The State Decoded project has spun off a couple of sub-projects, components of the larger project that can be useful for other purposes, and that deserve to stand alone. (Both are found on our GitHub repository.)

The first is Subsection Identifier, which turns theoretically structured text into actually structured text. It is common for documents in outline form (contracts, laws, and other documents that need to be able to cross-reference specific passages) to be provided in a format in which the structural labels flow into the text. For example:

A. The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
B. An NRP that has been appointed by the agency may be dissolved by the agency when:
1. There is no longer controversy associated with the development of the regulation;
2. The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
3. The agency determines that resolution of a controversy is unlikely.

One of the helpful features of The State Decoded is that it breaks up this text, understanding not just that every labelled line can stand alone, but also that the final line, despite being labelled “3,” is actually “B3,” since “3” is a subset of “B.” That functionality has been forked from The State Decoded, and now stands alone as Subsection Identifier, which accepts passages of text, and turns them into well-structured text, like such:

(
    [0] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => A
                )

            [prefix] => A.
            [text] => The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
        )

    [1] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                )

            [prefix] => B.
            [text] => An NRP that has been appointed by the agency may be dissolved by the agency when:
        )

    [2] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 1
                )

            [prefix] => B.1.
            [text] => There is no longer controversy associated with the development of the regulation;
        )

    [3] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 2
                )

            [prefix] => B.2.
            [text] => The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
        )

    [4] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 3
                )

            [prefix] => B.3.
            [text] => The agency determines that resolution of a controversy is unlikely.
        )

)

The second mini-project is Definition Scraper, which extracts defined terms from passages of text. Many legal documents begin by defining words that are then used throughout the document, and knowing those definitions can be crucial to understanding that document. So it can be helpful to be able to extract a list of terms and their definitions. Definition Scraper needs only be handed a passage of text, and it will determine whether it contains defined terms and, if it does, it will return a dictionary of those terms and their definitions.

Running this passage through Definition Scraper:

“The Program” refers to any copyrightable work licensed under this License.
A “covered work” means either the unmodified Program or a work based on the Program.

Yields the following two-entry dictionary:

(
    [program] => “The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.
    [covered work] => “covered work” means either the unmodified Program or a work based on the Program.
)

Definition Scraper is also a core function of The State Decoded, but warrants becoming its own project because it is so clearly useful for applications outside of the framework of The State Decoded.

The decision to spin off these projects was prompted by a report by the John S. and James L. Knight Foundation, the organization that funds The State Decoded, which evaluated the success of their News Challenge winners. They found several common attributes among the more successful funded projects, including this:

Projects that achieved strong use and adoption of their code often built and released their software in individual components, knowing that certain elements had value and a wide range of uses beyond the main focus of their project.

As development on The State Decoded continues, we may well spin off more mini-projects, if it becomes clear that more components of the overall project could be useful stand-alone tools.

Version 0.6 Released

Version 0.6 of The State Decoded is now available on GitHub. This release is a really exciting one—it establishes a public API for State Decoded sites and creates a standard XML format for importing laws! This is an important release of The State Decoded, one that stands to increase significantly the accessibility of the project to developers, both within the software and without. A total of 23 issues were resolved, nearly all of which are towards those two goals.

Public API

The State Decoded now has a fully fleshed out RESTful, JSON-based API. It has three methods: Law, Structure, and Dictionary. Law provides all available information about a given law. Structure provides all available information about a given structural unit (the various organizational units of legal codes—”titles,” “chapters,” “parts,” etc.). And Dictionary provides the definition (or definitions) for a term within a legal code. The data for these comes directly from the internal API that drives the site—what’s available publicly is what drives the site privately. In fact, I’m toying with the idea of having the site consume its own API, using internal APIs solely to serve data to the external API, and having every other part of the site get its data from that external API.

For a quick start trying this out on the Virginia site, you can use the trial key, 4l6dd9c124ddamq3 (though don’t build any applications using that key, or they will break when it expires), and see the API documentation to put it to work. For a really quick start, you can just browse the Code of Virginia via the API, check out a list of definitions, or read the text of a law. If you decide that you like what you see, register for a key and put this API to work.

Personally, this is the release that I’ve been waiting for. There’s an extent to which the purpose of The State Decoded project is really just to provide an API for legal codes; the fact that there’s a pretty website atop that API is just icing on the cake.

XML Format

A significant obstacle to implementing The State Decoded has been the need to customize the parser for each installation. Every legal code is different—there are no standards—with some all in one big SGML file, others stored in thousands of XML files, and other still needing to be scraped off of webpages. That necessitated modifying the State Decoded parser to interface the data from the source files with the internal API. That’s really not an obstacle to people and organizations who are serious about implementing The State Decoded. But plenty of people might be serious if they could just try it out first. There’s a huge gradient between “huh, looks interesting” and “I must get my city/state/country laws online!” It’s foolish to assume that people won’t just want to try it out first. After all, that’s how I prefer to get started with software and projects.

The solution was to establish an XML standard for importing legal codes into The State Decoded, to provide a low-barrier-to-entry path in addition to the more complex path. To be clear, this is not an attempt to create an XML standard for legal codes. This is a loosely typed standard, used solely as an import format for The State Decoded. Many legal codes are already stored as XML—that’s the most common file format—so getting those codes into The State Decoded now only requires writing a bit of XSLT. This is a much lower barrier to entry.

The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
<law>
	<structure>
		<unit label="" identifier="" order_by="" level=""></unit>
	</structure>
	<section_number></section_number>
	<catch_line></catch_line>
	<order_by></order_by>
	<text>
		<section prefix=""></section>
	</text>
	<history></history>
</law>

Several of those fields are optional, too. There will certainly be legal codes and organizations for which this won’t do the trick—they’ll need to modify the parser to handle some unusual types of data, fields, third-party data sources, etc. But for most people, this will be a big improvement.

The Code of Virginia is available as State Decoded XML, so if you’ve been considering playing with The State Decoded, it just got a whole lot easier to deploy a test site. Just download that XML and follow the installation instructions.

Thanks to Tom MacWright, Andrew Nacin, Daniel Trebbien, and Chad Robinson for their pull requests, wiki edits, and trouble tickets.

Version 0.5 Released

Version 0.5 of The State Decoded is now available on GitHub. This release is full of general enhancements, and some of them are significant. Twenty-four issues were resolved with this release, including some new features, some significant optimizations, some standardization, and further abstraction of functionality to make it easier to implement.

Here are the most interesting changes:

  • All functionality likely to require customization with each implementation now resides in a state-specific file, rather than being mixed in with core functionality.
  • The beginnings of a templating system are in place, allowing images, CSS, and HTML to be packaged together, in the general direction of how WordPress works.
  • A new method has been added to the Law class, that simply verifies that a given law exists. This has led to a 350% improvement in page rendering times (with the benchmark law, 2,142 milliseconds reduced to 610 milliseconds), a result of the need to verify that every law mentioned in a section actually exists.
  • Several files have been renamed, in order to prevent customizations from being overwritten with upgrades. This is an important step towards providing an upgrade path between versions.
  • Two bulk download files are automatically generated each time the parser is run—a JSON version of the custom dictionary, and a JSON version of the entire legal code.
  • Much has been done towards standardization generally, so that the project adheres to best practices in PHP and MySQL. While this is of little benefit to the end user, for anybody actually getting their hands dirty with code, it should make things much simpler. There’s a lot more to be done to comply with PEAR coding standards, but that’s underway.
  • Virginia attorney James Steele created a print stylesheet to format laws nicely when he printed them out. He was kind enough to contribute that to the project, and printouts of laws are now vastly improved.

Most of these changes are, in one way or another, moving the project towards standardization, automation, and normalization, to make it easier to deploy, maintain, and use. It should all be a lot easier to understand for a programmer diving into it for the first time.

The next release is version 0.6, dedicated to API improvements. That will be comprised of a relatively small number of issues, but they’re big ones: creating a RESTful JSON-based API, and supporting a crudely typed XML input format to simplify the process of parsing new codes. The latter is important, because the present arrangement requires that one know enough PHP to modify the parser to suit their own code’s unique storage and formatting. The idea here is that you can, alternately, use the tools of your choice to create an XML version of that code, and as long as that XML is of the style expected by the parser, it can be imported without having to edit a line of PHP in The State Decoded. Note that v0.6 was supposed to be the release in which the Solr search engine was integrated deeply into the software. That has now been pushed back—it’ll probably be v0.9—in order to accommodate a vendor’s schedule.

Version 0.4 Released

Today, version 0.4 of The State Decoded was tagged on GitHub and bundled up for download, the result of six weeks of work. This release is dedicated (almost) exclusively to enhancements to the dictionary system. Eighteen issues comprise the changes in this release, sixteen of which pertain to the built-in automatic, custom dictionary system, which finds defined terms within legal codes and stores them in a dictionary, using that data to embed contextual definitions that are relevant to each law.

There are a few big changes:

  • The State Decoded comes with a built-in dictionary of general legal terms. Using several different non-copyrighted, government-created legal dictionaries, a collection of nearly 500 terms have been put together, which will help people to understand common legal terms that are rarely defined within legal codes, such as “mutatis mutandis,” “tort,” “pro tem,” and “cause of action.”
  • Dictionary terms are now identified more aggressively, which means that for many states, the size and scope of the custom dictionary is going to expand substantially. In the case of Virginia there was a 49% increase (a leap from 7,681 to 11,504 definitions), a striking difference that could be observed immediately when browsing the site.
  • The problem of nested/overlapping definitions has been solved. When one definition was nested within another (e.g., if we have definitions for both “robbery” and “armed robbery”), then mousing over “robbery” would yield a pair of pop-up definitions, one obscuring the other. Now only the definition for the longest term is defined under those circumstances.
  • Internal terminology has been standardized. In various places the dictionary and its components were all called different things (glossary, definitions, dictionary, terms, etc.) in different places. Now the collection of words is called a “dictionary,” each defined word is a “term,” and the description of that that term means is a “definition.”)
  • The retrieval and display of definitions is substantially faster—they take about half the time that they used to. This is a result of optimizing and simplifying the structure of the database table in which definitions are stored.

A list of all closed issues is available for those who want specifics. And for those who are suckers for details, this is the first release for which a detailed Git commit log is available, with relatively detailed comments for all 68 commits that comprise this release.

This release is two weeks late, almost entirely because of time spent on a pernicious and difficult parsing bug that, it only occurred to me today, shouldn’t have blocked this release because, while an important problem, it has absolutely nothing to do with definitions. (The problem that is being wrestled with is how to handle subsections of laws that span paragraphs. Easy to describe, difficult to solve, at least for those state codes that pretend that a paragraph and a section are one and the same. I’m looking at you, Virginia.) That issue has been moved back to v0.5, and I’ll go right back to wrestling with it on Monday.

Next up, version 0.5 will be another general-enhancements release. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project. Version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly. Version 0.8 will be the user interface release, in which the design will be overhauled, a responsive design will be implemented, serious work will go into the typography, an intercode navigation system will be implemented, contextual help and explanations will be embedded throughout, and the results of some light UI testing will be incorporated. Version 0.9 will be dedicated to optimizations—making everything go faster and be more fault-tolerant, both through improving the code base and supporting the APC and Varnish caching systems. And, finally, version 1.0 will be the first release in which State Decoded becomes a platform that facilitates the sort of analysis and data exchange that makes this project so full of possibility—things like flexible content export, visualizations, user portfolios of interesting laws, and surely lots of other things.

Version 0.3 Released

Just one week after the release of version 0.2 of The State Decoded comes version 0.3. (You can download it as a 308 KB tarball.) It consists of 23 general enhancements, notably including:

  • The parser that imports legal codes and populates the site has been simplified to be both non-working and a significantly more useful template to be followed to create new state import systems. Previously it was fully functional for importing the Code of Virginia, which was too much detail to serve as a guide.
  • Improved support for and optimization of custom, state-specific functions.
  • Unnecessary chunks of the function library removed, with remaining useful portions of them integrated into other functions.
  • Beginnings of support for APC for variable storage, starting with moving constants into APC.
  • A hook for and sample functionality to turn each law’s history section into a plain-English description of that history, along with links to see the acts of the legislature that made those changes.
  • 404 functionality added for proper error handling of requests for non-existent section numbers and structural units.
  • Added arrow-key based navigation to move to prior and next sections within a single structural container.
  • Provided a sample .htaccess file for supporting a decent URL structure.
  • Moved JavaScript assets out of the general template and into the specific template to eliminate unnecessary code.

As with the prior two releases, this is an alpha release—there’s no installer, documentation, or administrative backend. With this release the gap between the released packages and the version of the software powering Virginia Decoded and Sunshine Statutes (Florida) is smaller than ever, and I’m hopeful that I can port those sites over to run on v0.3 of The State Decoded.

With this release, the project is back on the monthly release schedule that was started in June. A roadmap is emerging for the next few releases. Version 0.4 will be dedicated almost entirely to enhancements to the dictionary system that makes laws self-documenting, and that’s due out on September 1. Version 0.5 will be another general-enhancements release, due out on October 1. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project, due out on November 1. (Solr functionality, by the way, is made possible by a generous contribution from Open Source Connections, specifically David Dodge, Joseph Featherston, and Kasey McKenna, who recently spent a great deal of time setting up Solr to support The State Decoded.) And version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly, and that’s due out on December 1.

Version 0.2 Released

Version 0.2 of the State Decoded is now available for download from the project’s Github page. (Or you can grab it directly as a 308 KB tarball.) It consists of 21 enhancements, notably including:

  • Ongoing work to “de-Virginia-ize” the codebase, necessary because the software was developed originally just for the Code of Virginia. This includes full abstraction of the structure storage system, moving away from Virginia-specific nomenclature (“title,” “chapter,” “code,” etc.),
  • Significant optimizations, including the dynamic creation of a SQL view to access structural data and moving to using section IDs instead of section numbers.
  • A parser for law histories.
  • The establishment of a general metadata table to allow the storage and display of arbitrary types of information on a per-law law.
  • Initial steps towards integrating Solr into the project.
  • Support for storing multiple versions of the same law (e.g., both the 2011 and 2012 revisions) simultaneously.

As with v0.1, this version is an alpha release—there’s no installer, documentation, administrative backend. It’s only for the brave or curious. Virginia Decoded and Sunshine Statutes (Florida) are not running on this release but, instead, a modified version. Each release will get closer to providing all of the functionality of these live sites with the flexibility and abstraction to provide the same functions for other states, and I remain optimistic that v0.3 will be the release that can be installed on these live sites, so that I can eat my own dog food.

The First Software Release

Last Thursday, the source code for the State Decoded project went up on GitHub, from which it can be downloaded. It’s merely version 0.1 of the product, not at all pretty, but it’s an important milestone that mustn’t go unacknowledged.

The State Decoded started off as Virginia Decoded, a project that I spent nights and weekends on for a year or so. The Knight Foundation’s $165,000 grant funds taking that Virginia-specific code and abstracting it sufficiently to apply to other states, cities, or even countries. In the three months that this has been my full-time job, much of my time has been spent on the very specific task of de-Virginia-ing the code, so that it can work anywhere.

This v0.1 release is the result of that streamlining. It required some significant architectural changes. The biggest change was providing support for the widely varying structures of legal codes. The Code of Virginia is broken into titles, which are broken into chapters, which are broken into articles, and those are made up of individual sections.* Accordingly, I had tables in the database for each of these structures. As a result, the software could only work for legal codes that used the same three-tiered structural system. That had to be tossed out and rewritten. There were a series of changes of this nature, all of which simplified and normalized the software’s functionality.

Anybody looking to launch their own implementation of the State Decoded with this v0.1 release would be disappointed by the awkwardness of the process. There’s no installer, instructions, or clever administration system. But it is functional, structurally intact, and extremely informative for anybody interested in putting it to work.

I’m marching towards v0.2, scheduled for release one month from now, and working on getting Virginia Decoded using the live release of the State Decoded software, instead of the State Decoded software being merely derived from Virginia Decoded. The two should converge somewhere around v0.3, and that’s the next major developmental milestone in this project.

* The official SGML file of the Code of Virginia has no representation of articles and, as a result, they are not employed on Virginia Decoded. That is not a permanent problem, but that is the present state of things.

The Road Ahead

With yesterday’s launch of Virginia Decoded, there are suddenly a lot of people who would love to set up The State Decoded for their own state, something that isn’t possible just yet. This calls for an explanation of what the plan is.

Building Virginia Decoded was an evenings-and-weekends hobby for me starting in the summer of 2010. I learned the structure of the Code of Virginia as I went, and built the site explicitly to mirror that structure. Some friends who were alpha testing the site in late 2010 insisted that it could be used in other states. I applied to the John S. and James L. Knight Foundation for the funding to overhaul the Virginia Decoded code base, abstract it enough that it could support the widely varying structure of legal codes throughout the United States, and turn it into a proper open source project. In June the Knight Foundation named the State Decoded project one of the winners of the 2011 News Challenge. The funding came through a few days before the end of 2011, and I was able to get started on the project two weeks ago.

Launching Virginia Decoded was easy, because I’d created the site long before getting started on The State Decoded.

The next task is to scrub the Virginia Decoded source of all material that shouldn’t be released publicly (passwords, API keys, etc.), at which point it can all go up on Github. (You can find Virginia Decoded’s parser on Github already.)

Then comes the real work, which is eliminating all of the functionality that is fundamentally Virginian, and replacing it with more flexible functionality. For instance, the Code of Virginia is broken into titles, which are broken into chapters, which is broken into sections (each section is a single law). But California’s laws are broken down into codes, which are broken into divisions, which are broken into chapters, which are broken into articles, which are broken into sections. As a result, California’s laws just won’t work in the software’s existing framework, which is premised on the assumption that codes are divided into three levels, and that those levels are called “titles,” “chapters,” and “sections.” This and other, similar problems are wholly solvable—they’ll just require some reflection and some time. Luckily, solving those problems is my full-time job for the foreseeable future, courtesy of the Knight Foundation.

When the State Decoded code base is sufficiently abstracted to work across states, that’s when things will get fun.

Eager to get started for your own state? Watch the project on Github, check out the parser, and contact me to let me know what state you’re interested in and in what capacity you want to help make that happen.