Two Mini-Projects: Subsection Identifier and Definition Scraper

The State Decoded project has spun off a couple of sub-projects, components of the larger project that can be useful for other purposes, and that deserve to stand alone. (Both are found on our GitHub repository.)

The first is Subsection Identifier, which turns theoretically structured text into actually structured text. It is common for documents in outline form (contracts, laws, and other documents that need to be able to cross-reference specific passages) to be provided in a format in which the structural labels flow into the text. For example:

A. The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
B. An NRP that has been appointed by the agency may be dissolved by the agency when:
1. There is no longer controversy associated with the development of the regulation;
2. The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
3. The agency determines that resolution of a controversy is unlikely.

One of the helpful features of The State Decoded is that it breaks up this text, understanding not just that every labelled line can stand alone, but also that the final line, despite being labelled “3,” is actually “B3,” since “3” is a subset of “B.” That functionality has been forked from The State Decoded, and now stands alone as Subsection Identifier, which accepts passages of text, and turns them into well-structured text, like such:

    [0] => stdClass Object
            [prefix_hierarchy] => stdClass Object
                    [0] => A

            [prefix] => A.
            [text] => The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.

    [1] => stdClass Object
            [prefix_hierarchy] => stdClass Object
                    [0] => B

            [prefix] => B.
            [text] => An NRP that has been appointed by the agency may be dissolved by the agency when:

    [2] => stdClass Object
            [prefix_hierarchy] => stdClass Object
                    [0] => B
                    [1] => 1

            [prefix] => B.1.
            [text] => There is no longer controversy associated with the development of the regulation;

    [3] => stdClass Object
            [prefix_hierarchy] => stdClass Object
                    [0] => B
                    [1] => 2

            [prefix] => B.2.
            [text] => The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or

    [4] => stdClass Object
            [prefix_hierarchy] => stdClass Object
                    [0] => B
                    [1] => 3

            [prefix] => B.3.
            [text] => The agency determines that resolution of a controversy is unlikely.


The second mini-project is Definition Scraper, which extracts defined terms from passages of text. Many legal documents begin by defining words that are then used throughout the document, and knowing those definitions can be crucial to understanding that document. So it can be helpful to be able to extract a list of terms and their definitions. Definition Scraper needs only be handed a passage of text, and it will determine whether it contains defined terms and, if it does, it will return a dictionary of those terms and their definitions.

Running this passage through Definition Scraper:

“The Program” refers to any copyrightable work licensed under this License.
A “covered work” means either the unmodified Program or a work based on the Program.

Yields the following two-entry dictionary:

    [program] => “The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.
    [covered work] => “covered work” means either the unmodified Program or a work based on the Program.

Definition Scraper is also a core function of The State Decoded, but warrants becoming its own project because it is so clearly useful for applications outside of the framework of The State Decoded.

The decision to spin off these projects was prompted by a report by the John S. and James L. Knight Foundation, the organization that funds The State Decoded, which evaluated the success of their News Challenge winners. They found several common attributes among the more successful funded projects, including this:

Projects that achieved strong use and adoption of their code often built and released their software in individual components, knowing that certain elements had value and a wide range of uses beyond the main focus of their project.

As development on The State Decoded continues, we may well spin off more mini-projects, if it becomes clear that more components of the overall project could be useful stand-alone tools.

Version 0.6 Released

Version 0.6 of The State Decoded is now available on GitHub. This release is a really exciting one—it establishes a public API for State Decoded sites and creates a standard XML format for importing laws! This is an important release of The State Decoded, one that stands to increase significantly the accessibility of the project to developers, both within the software and without. A total of 23 issues were resolved, nearly all of which are towards those two goals.

Public API

The State Decoded now has a fully fleshed out RESTful, JSON-based API. It has three methods: Law, Structure, and Dictionary. Law provides all available information about a given law. Structure provides all available information about a given structural unit (the various organizational units of legal codes—”titles,” “chapters,” “parts,” etc.). And Dictionary provides the definition (or definitions) for a term within a legal code. The data for these comes directly from the internal API that drives the site—what’s available publicly is what drives the site privately. In fact, I’m toying with the idea of having the site consume its own API, using internal APIs solely to serve data to the external API, and having every other part of the site get its data from that external API.

For a quick start trying this out on the Virginia site, you can use the trial key, 4l6dd9c124ddamq3 (though don’t build any applications using that key, or they will break when it expires), and see the API documentation to put it to work. For a really quick start, you can just browse the Code of Virginia via the API, check out a list of definitions, or read the text of a law. If you decide that you like what you see, register for a key and put this API to work.

Personally, this is the release that I’ve been waiting for. There’s an extent to which the purpose of The State Decoded project is really just to provide an API for legal codes; the fact that there’s a pretty website atop that API is just icing on the cake.

XML Format

A significant obstacle to implementing The State Decoded has been the need to customize the parser for each installation. Every legal code is different—there are no standards—with some all in one big SGML file, others stored in thousands of XML files, and other still needing to be scraped off of webpages. That necessitated modifying the State Decoded parser to interface the data from the source files with the internal API. That’s really not an obstacle to people and organizations who are serious about implementing The State Decoded. But plenty of people might be serious if they could just try it out first. There’s a huge gradient between “huh, looks interesting” and “I must get my city/state/country laws online!” It’s foolish to assume that people won’t just want to try it out first. After all, that’s how I prefer to get started with software and projects.

The solution was to establish an XML standard for importing legal codes into The State Decoded, to provide a low-barrier-to-entry path in addition to the more complex path. To be clear, this is not an attempt to create an XML standard for legal codes. This is a loosely typed standard, used solely as an import format for The State Decoded. Many legal codes are already stored as XML—that’s the most common file format—so getting those codes into The State Decoded now only requires writing a bit of XSLT. This is a much lower barrier to entry.

The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
		<unit label="" identifier="" order_by="" level=""></unit>
		<section prefix=""></section>

Several of those fields are optional, too. There will certainly be legal codes and organizations for which this won’t do the trick—they’ll need to modify the parser to handle some unusual types of data, fields, third-party data sources, etc. But for most people, this will be a big improvement.

The Code of Virginia is available as State Decoded XML, so if you’ve been considering playing with The State Decoded, it just got a whole lot easier to deploy a test site. Just download that XML and follow the installation instructions.

Thanks to Tom MacWright, Andrew Nacin, Daniel Trebbien, and Chad Robinson for their pull requests, wiki edits, and trouble tickets.

No Love from LexisNexis

This is what the table of contents looks like in LexisNexis’s printed edition of the Code of Virginia:

Lexis Table of Contents

I was a bit stunned the first time I saw this. It’s just word soup. There’s simply no effort to make it legible. No thought has gone into this. There is, in short, no love.

I feel like, as a culture, we basically understand how to make tables of contents. Right? Grabbing a few books off my desk, more or less at random, I thought I’d compare Lexis’s table of contents to those of others. Here’s The Chicago Manual of Style:

Chicago Table of Contents

Designing with Web Standards, by Jeffrey Zeldman and Ethan Marcotte:

Zeldman Table of Contents

And Robert Bringhurst’s The Elements of Typographic Style:

Bringhurst Table of Contents

These are all different, but via various small design cues they all manage to accomplish the same thing: they make it easy for somebody to browse through the contents of the text and locate the specific section that they need. Microsoft Word, right out of the box, will happily render a table of contents in styles reminiscent of all of these, with minimal effort.

LexisNexis isn’t even trying. I can’t pretend to know why. But with this as the current state of affairs in the presentation of legal information, it’s trivial for The State Decoded—or anybody with a copy of Word—to improve upon it.

Version 0.5 Released

Version 0.5 of The State Decoded is now available on GitHub. This release is full of general enhancements, and some of them are significant. Twenty-four issues were resolved with this release, including some new features, some significant optimizations, some standardization, and further abstraction of functionality to make it easier to implement.

Here are the most interesting changes:

  • All functionality likely to require customization with each implementation now resides in a state-specific file, rather than being mixed in with core functionality.
  • The beginnings of a templating system are in place, allowing images, CSS, and HTML to be packaged together, in the general direction of how WordPress works.
  • A new method has been added to the Law class, that simply verifies that a given law exists. This has led to a 350% improvement in page rendering times (with the benchmark law, 2,142 milliseconds reduced to 610 milliseconds), a result of the need to verify that every law mentioned in a section actually exists.
  • Several files have been renamed, in order to prevent customizations from being overwritten with upgrades. This is an important step towards providing an upgrade path between versions.
  • Two bulk download files are automatically generated each time the parser is run—a JSON version of the custom dictionary, and a JSON version of the entire legal code.
  • Much has been done towards standardization generally, so that the project adheres to best practices in PHP and MySQL. While this is of little benefit to the end user, for anybody actually getting their hands dirty with code, it should make things much simpler. There’s a lot more to be done to comply with PEAR coding standards, but that’s underway.
  • Virginia attorney James Steele created a print stylesheet to format laws nicely when he printed them out. He was kind enough to contribute that to the project, and printouts of laws are now vastly improved.

Most of these changes are, in one way or another, moving the project towards standardization, automation, and normalization, to make it easier to deploy, maintain, and use. It should all be a lot easier to understand for a programmer diving into it for the first time.

The next release is version 0.6, dedicated to API improvements. That will be comprised of a relatively small number of issues, but they’re big ones: creating a RESTful JSON-based API, and supporting a crudely typed XML input format to simplify the process of parsing new codes. The latter is important, because the present arrangement requires that one know enough PHP to modify the parser to suit their own code’s unique storage and formatting. The idea here is that you can, alternately, use the tools of your choice to create an XML version of that code, and as long as that XML is of the style expected by the parser, it can be imported without having to edit a line of PHP in The State Decoded. Note that v0.6 was supposed to be the release in which the Solr search engine was integrated deeply into the software. That has now been pushed back—it’ll probably be v0.9—in order to accommodate a vendor’s schedule.

Version 0.4 Released

Today, version 0.4 of The State Decoded was tagged on GitHub and bundled up for download, the result of six weeks of work. This release is dedicated (almost) exclusively to enhancements to the dictionary system. Eighteen issues comprise the changes in this release, sixteen of which pertain to the built-in automatic, custom dictionary system, which finds defined terms within legal codes and stores them in a dictionary, using that data to embed contextual definitions that are relevant to each law.

There are a few big changes:

  • The State Decoded comes with a built-in dictionary of general legal terms. Using several different non-copyrighted, government-created legal dictionaries, a collection of nearly 500 terms have been put together, which will help people to understand common legal terms that are rarely defined within legal codes, such as “mutatis mutandis,” “tort,” “pro tem,” and “cause of action.”
  • Dictionary terms are now identified more aggressively, which means that for many states, the size and scope of the custom dictionary is going to expand substantially. In the case of Virginia there was a 49% increase (a leap from 7,681 to 11,504 definitions), a striking difference that could be observed immediately when browsing the site.
  • The problem of nested/overlapping definitions has been solved. When one definition was nested within another (e.g., if we have definitions for both “robbery” and “armed robbery”), then mousing over “robbery” would yield a pair of pop-up definitions, one obscuring the other. Now only the definition for the longest term is defined under those circumstances.
  • Internal terminology has been standardized. In various places the dictionary and its components were all called different things (glossary, definitions, dictionary, terms, etc.) in different places. Now the collection of words is called a “dictionary,” each defined word is a “term,” and the description of that that term means is a “definition.”)
  • The retrieval and display of definitions is substantially faster—they take about half the time that they used to. This is a result of optimizing and simplifying the structure of the database table in which definitions are stored.

A list of all closed issues is available for those who want specifics. And for those who are suckers for details, this is the first release for which a detailed Git commit log is available, with relatively detailed comments for all 68 commits that comprise this release.

This release is two weeks late, almost entirely because of time spent on a pernicious and difficult parsing bug that, it only occurred to me today, shouldn’t have blocked this release because, while an important problem, it has absolutely nothing to do with definitions. (The problem that is being wrestled with is how to handle subsections of laws that span paragraphs. Easy to describe, difficult to solve, at least for those state codes that pretend that a paragraph and a section are one and the same. I’m looking at you, Virginia.) That issue has been moved back to v0.5, and I’ll go right back to wrestling with it on Monday.

Next up, version 0.5 will be another general-enhancements release. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project. Version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly. Version 0.8 will be the user interface release, in which the design will be overhauled, a responsive design will be implemented, serious work will go into the typography, an intercode navigation system will be implemented, contextual help and explanations will be embedded throughout, and the results of some light UI testing will be incorporated. Version 0.9 will be dedicated to optimizations—making everything go faster and be more fault-tolerant, both through improving the code base and supporting the APC and Varnish caching systems. And, finally, version 1.0 will be the first release in which State Decoded becomes a platform that facilitates the sort of analysis and data exchange that makes this project so full of possibility—things like flexible content export, visualizations, user portfolios of interesting laws, and surely lots of other things.

Typeface Authority

With the design process for The State Decoded underway, we’re putting a lot of thought into typography. Helpful to this process has been both Ruth Anne Robbins’ “Painting with print: Incorporating concepts of typographic and layout design into the text of legal writing documents” and Derek H. Kiernan-Johnson’s “Telling Through Type: Typography and Narrative in Legal Briefs.”

Both of those papers are conceptual in nature, so they’re complemented nicely by Errol Morris’ two-part series [1, 2] about the results of a quiz that he ran on the New York Times website, ostensibly measuring readers’ optimism. In fact, he was measuring the impact of different typefaces on readers’ responses. Those who doubt that a typeface could have much of an impact on the credulity of a reader should consider the effect of Comic Sans, which Morris discovered (unsurprisingly) correlated strongly with incredulity on the part of readers. Of the six typefaces that he tested (Baskerville, Comic Sans, Computer Modern, Georgia, Helvetica, and Trebuchet), Baskerville proved the most persuasive. The effect was small, but significant.

This is the sort of consideration that is clearly lacking in the present rendering of laws, both online and in print. (Typographically, LexisNexis’s printed state codes are a train wreck.) It’s also precisely the consideration that will set apart those sites based on The State Decoded, or anybody who cares to employ the project’s stylesheets. There will be more news about this ongoing design work in the weeks ahead.

Version 0.3 Released

Just one week after the release of version 0.2 of The State Decoded comes version 0.3. (You can download it as a 308 KB tarball.) It consists of 23 general enhancements, notably including:

  • The parser that imports legal codes and populates the site has been simplified to be both non-working and a significantly more useful template to be followed to create new state import systems. Previously it was fully functional for importing the Code of Virginia, which was too much detail to serve as a guide.
  • Improved support for and optimization of custom, state-specific functions.
  • Unnecessary chunks of the function library removed, with remaining useful portions of them integrated into other functions.
  • Beginnings of support for APC for variable storage, starting with moving constants into APC.
  • A hook for and sample functionality to turn each law’s history section into a plain-English description of that history, along with links to see the acts of the legislature that made those changes.
  • 404 functionality added for proper error handling of requests for non-existent section numbers and structural units.
  • Added arrow-key based navigation to move to prior and next sections within a single structural container.
  • Provided a sample .htaccess file for supporting a decent URL structure.
  • Moved JavaScript assets out of the general template and into the specific template to eliminate unnecessary code.

As with the prior two releases, this is an alpha release—there’s no installer, documentation, or administrative backend. With this release the gap between the released packages and the version of the software powering Virginia Decoded and Sunshine Statutes (Florida) is smaller than ever, and I’m hopeful that I can port those sites over to run on v0.3 of The State Decoded.

With this release, the project is back on the monthly release schedule that was started in June. A roadmap is emerging for the next few releases. Version 0.4 will be dedicated almost entirely to enhancements to the dictionary system that makes laws self-documenting, and that’s due out on September 1. Version 0.5 will be another general-enhancements release, due out on October 1. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project, due out on November 1. (Solr functionality, by the way, is made possible by a generous contribution from Open Source Connections, specifically David Dodge, Joseph Featherston, and Kasey McKenna, who recently spent a great deal of time setting up Solr to support The State Decoded.) And version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly, and that’s due out on December 1.

Version 0.2 Released

Version 0.2 of the State Decoded is now available for download from the project’s Github page. (Or you can grab it directly as a 308 KB tarball.) It consists of 21 enhancements, notably including:

  • Ongoing work to “de-Virginia-ize” the codebase, necessary because the software was developed originally just for the Code of Virginia. This includes full abstraction of the structure storage system, moving away from Virginia-specific nomenclature (“title,” “chapter,” “code,” etc.),
  • Significant optimizations, including the dynamic creation of a SQL view to access structural data and moving to using section IDs instead of section numbers.
  • A parser for law histories.
  • The establishment of a general metadata table to allow the storage and display of arbitrary types of information on a per-law law.
  • Initial steps towards integrating Solr into the project.
  • Support for storing multiple versions of the same law (e.g., both the 2011 and 2012 revisions) simultaneously.

As with v0.1, this version is an alpha release—there’s no installer, documentation, administrative backend. It’s only for the brave or curious. Virginia Decoded and Sunshine Statutes (Florida) are not running on this release but, instead, a modified version. Each release will get closer to providing all of the functionality of these live sites with the flexibility and abstraction to provide the same functions for other states, and I remain optimistic that v0.3 will be the release that can be installed on these live sites, so that I can eat my own dog food.

Sunshine Statutes Goes into Public Alpha

The Florida implementation of The State Decoded has launched as a public alpha test. Sunshine Statutes resulted from strong interest in the project from the Florida Society of News Editors and the First Amendment Foundation, especially Rick Hirsch, the Managing Editor of the Miami Herald. Within hours of the John S. and James L. Knight Foundation announcing the $165,000 grant that funds the State Decoded project, Rick was insisting that Florida was the perfect state to start things off, and he was right about that. Open data hacker Michael Tahani did the heavy lifting of creating the parser, which reads the XML of the Florida Statutes and turns it into a format familiar to the State Decoded’s software.

The resulting site is rather beyond a proof of concept, but surely not finished. (Hence the “alpha test” moniker.) Some statutes with particularly complex structures are missing some text, and not all statute histories are being parsed correctly, but we’ll be ticking down the list of fixes and getting everything repaired soon enough. (Every statute has a link back to its listing on the Florida legislature’s website, making it easy for folks to see the official version of the text.) Once all known content-related bugs are fixed, it’ll enter “beta” status, and the dire warnings can be stripped away.

In the few days since we announced Sunshine Statutes, there’s been an outpouring of offerings of help from Floridians. Putting together a site like this—and keeping it going—is more like a barn-raising than a monolithic construction project. Any other folks who are so moved to get involved are welcome to contact us—the folks at the FSNE and the FAF would surely love the assistance, and there’s certainly a lot of work to be done.

Creating Data Where There is None

At, Robb Shecter is puzzling through how to deal with the California Code’s curious lack of titles. Most state codes provide a title for each law (known as a “catch line” in most states), such as “Enforcement of child labor law,” “Fees for filing documents or issuing certificates,” or “Money derived from forest reserve.” Not California’s. Robb provides the example of California’s § 459, the law prohibiting burglary. One must read through the law to know what it does. This, of course, makes it very difficult to navigate through the California Code.

The question that Robb asks is what we are to do about this. The problem is abstract for me—I have no immediate prospects of working on the California Code, but Robb has it online now, so the problem is very real for him.

The reason that California has been able to get by with such an odd arrangement is that private legal vendors, like West and LexisNexis, write their own titles for laws. Most attorneys surely use the terminology provided by those companies, some perhaps unaware that those are not official titles. Those titles are copyrighted by those vendors, though, and cannot be used for projects like or The State Decoded. This means that we must be able to generate our own titles.

Here is the conceptual solution that I arrived at for California some months ago, which I share here in hopes that it might do others some good. I have not implemented this, so while in theory it makes sense, I cannot say for sure that it’ll work.

Like many states, California maintains an annual index of all legislation that has come before their legislature. (This is the 2011–2012 index, for example.) This allows people to look up all bills pertaining, for instance, retirement, and see the following listing:

continuing care retirement communities, AB 748, 1698
early distribution penalty waiver, AB 558, 2656
employer-sponsored retirement plans, SB 1234
golden state retirement savings trust, SB 1234
rollover funds, tax-free: medical and long-term care premiums, SJR 21
secure choice retirement savings trust, california, SB 1234
public retirement systems. See name of particular retirement system (e.g., PUBLIC EMPLOYEES’ RETIREMENT SYSTEM).
unemployment compensation benefits, AB 2310

This looks to me like a rich source of titles.

The process is straightforward. First, match up all legislation with the existing law that it proposes to amend. Then, find every entry for all of that legislation in the index of legislation. The description in the index becomes the title of the law. For those laws that have multiple candidate descriptions (either because they’re in the index repeatedly, multiple bills propose to amend them in a given year, or there are many years of attempted amendments), the words that appear most frequently in those descriptions can be used to automatically assemble a title.

This is bound to lead to some goofy titles. And some laws have not had bills introduced that would amend them for decades, and so information about them would not be available in bulk. But in my experience, the laws that most interest people are the ones that legislators attempt to amend, so titles would be provided for those laws that are most liable to be read.

What of the rest of the laws, left untitled by this first method? Statistically improbable phrases (SIPs) are a good backup method. A phrase that occurs in a law that is very rarely found in the rest of California’s laws is liable to be a decent candidate for its title. Again, potentially goofy titles could result, and I have not tested this, but theoretically it could work pretty well. displays SIPs for some of their books, and I think those illustrate the range of results that one could expect from them. For instance, Nora Ephron’s newly re-popular “I Feel Bad About My Neck” has two SIPs: “serial monogamy,” “cabbage strudel.” The former is a not-unreasonable summation of of the book. The latter is obviously pretty unreasonable.

Some experimentation is going to be necessary to arrive at a decent system for generating titles for California’s laws. Ideally, whomever creates them would put them up on Google Docs for some collaborative editing, and release the resulting text under an open license, so that, at last, we will all have titles for all of the laws in the California Code.

This problem is, not incidentally, emblematic of a routine problem with state codes. It seems like they’re all missing something, some core piece of data that would make them far more useful. Each of these will require its own patch, its own work-around, to render those laws widely accessible to the general public. We’re all taking it one state at a time.