Bulk Downloads of Five State Codes

A very real obstacle to putting up state code websites is getting a copy of that state’s laws. For example, there’s a New Jersey group that wants to set up The State Decoded for their state. But, like most states, New Jersey doesn’t provide bulk downloads—it’s not possible to simply get a raw copy of the files. The backup option is what’s known as “screen-scraping”—having software load every single law on the official state law website, one by way, and copy the laws from there. This is a terrible solution, but it’s all that’s available in most U.S. states. The New Jersey statutes website is distinctly un-scrapeable. I don’t know that it’s impossible, but it would be an unpleasant task.

Today, Carl Malamud of Public.Resource.org tweeted the news that he’s got five new state codes online as bulk data:

In addition to bulk machine-readable files, they’re also available in a variety of file formats on Archive.org. They join the Maryland and Washington D.C. codes that he’s already made available as bulk downloads. (Maryland Decoded is up now, and the Open Law DC project has a great site for their code, with a State Decoded implementation under development that’ll be the subject of a hackathon on Saturday’s National Day of Civic Hacking.)

Now the onus is on folks in Arkansas, Colorado, Georgia, Idaho, and Mississippi to set up to the plate and put this data to work. Who’s going to implement The State Decoded in these states?

Maryland Decoded

A new State Decoded site launched today: Maryland Decoded. A project of the OpenGov Foundation, they’re doing some innovative stuff on the still-under-development platform. For instance, they’re crowd-sourcing “catch lines”—the titles that most states apply to their laws. Maryland does not have catch lines, so instead of having a law titled “Murder in the First Degree,” they simply have GCR § 2-201. Solution? Anybody can suggest a catch line, and they’ll build up their own catch lines, gradually.

Every state presents its own set of challenges and opportunities. The OpenGov Foundation is capitalizing on the opportunities to overcome the challenges and helping to improve The State Decoded for those who will follow in their path.

Washington D.C. and the Work Ahead

On Greater Greater Washington, Tom MacWright recently wrote a blog entry highlighting the problems of access to the Washington D.C. Code. There is, first, a legal obstacle: Washington D.C. claims copyright over their laws, which is to say that it is illegal to reproduce them without permission of the city. Then, second, what is perhaps a more significant obstacle: they outsource the maintenance of their legal code.

The city of Washington D.C. long ago started paying WestLaw—and now LexisNexis—to turn the D.C. Council’s bills into laws. As a result, they now have neither the knowledge nor the infrastructure to maintain their own laws. The only way that D.C. can find out what their laws say is to pay LexisNexis to tell them. This is consequently true for the public, as well. If a resident of D.C.—like MacWright—wants to know what the law says, there’s no sense in asking (or FOIAing) the city, because the city has outsourced the process so completely that they know nothing.

MacWright has a few options to know what the law says. The first is to travel to a library on each occasion that he wants to know something (assuming he can find one that has a current copy of the DC Code), and read it there. The second is that he can buy a copy, for $867.00. And the third is that he can use the DC Code website, maintained by WestLaw, which is every bit as awful as any other state code website.

So how is the D.C. Code to get the State Decoded treatment? How can a digital copy be imported into the software, for the general public benefit? It can’t be FOIAed from D.C. Council, since they don’t have it. It’s clearly impractical to scan in 25 volumes of hardbound books. Normally that would leave scraping the website, but WestLaw’s website has a EULA that prohibits copying material off of the website. WestLaw has been hired to do for the Washington D.C. government what they cannot or will not do for themselves—post laws to the web—and because they choose to impose copyright restrictions, that is a legal barrier preventing that material from being reused.

Normally, this would be the end of the road—Washington D.C. would have cut off their code from being improved (or even reused) in any way by third parties. In this case, though, the story has a different ending. Public.Resource.Org has taken the surprising and admirable tack of purchasing all of the volumes of the D.C. Code, slicing them up, scanning them in, OCRing them, and distributing them for free.

DC Code Assembly Line

Deglued bundles on the bottom right awaiting their turn on the two high-speed scanners. When they are completed, the bundles are put upper right to await a QA pass.

All of the volumes can be downloaded as PDFs or, via the Internet Archive, in nearly any other file format one can think of. For those who would like a print copy for a more reasonable price, print-on-demand service Lulu sells each volume for just $12.

Lest the motivations of Public.Resource.Org be unclear, a “Proclamation of Digitization” accompanies the release, citing a pair of Supreme Court rulings (“the authentic exposition and interpretation of the law, which, binding every citizen, is free for publication to all, whether it is a declaration of unwritten law, or an interpretation of a constitution or a statute”) and declaring that “any assertion of copyright by the District of Columbia or other parties on the District of Columbia Code is declared to be NULL AND VOID as a matter of law and public policy as it is the right of every person to read, know, and speak the laws that bind them.” The organization mailed out elaborate packages, containing portions of the D.C. Code, to announce its availability. (You can see my own unboxing photos.)

All that remains is for somebody to marry this source of data with The State Decoded as, indeed, somebody is already talking about doing. The D.C. Council or WestLaw may not be happy about this—it’s quite possible that one or both entities will take legal action to halt this—but I’m confident that it will be found that the law supports making those very laws public.

Two Mini-Projects: Subsection Identifier and Definition Scraper

The State Decoded project has spun off a couple of sub-projects, components of the larger project that can be useful for other purposes, and that deserve to stand alone. (Both are found on our GitHub repository.)

The first is Subsection Identifier, which turns theoretically structured text into actually structured text. It is common for documents in outline form (contracts, laws, and other documents that need to be able to cross-reference specific passages) to be provided in a format in which the structural labels flow into the text. For example:

A. The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
B. An NRP that has been appointed by the agency may be dissolved by the agency when:
1. There is no longer controversy associated with the development of the regulation;
2. The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
3. The agency determines that resolution of a controversy is unlikely.

One of the helpful features of The State Decoded is that it breaks up this text, understanding not just that every labelled line can stand alone, but also that the final line, despite being labelled “3,” is actually “B3,” since “3” is a subset of “B.” That functionality has been forked from The State Decoded, and now stands alone as Subsection Identifier, which accepts passages of text, and turns them into well-structured text, like such:

(
    [0] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => A
                )

            [prefix] => A.
            [text] => The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
        )

    [1] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                )

            [prefix] => B.
            [text] => An NRP that has been appointed by the agency may be dissolved by the agency when:
        )

    [2] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 1
                )

            [prefix] => B.1.
            [text] => There is no longer controversy associated with the development of the regulation;
        )

    [3] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 2
                )

            [prefix] => B.2.
            [text] => The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
        )

    [4] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 3
                )

            [prefix] => B.3.
            [text] => The agency determines that resolution of a controversy is unlikely.
        )

)

The second mini-project is Definition Scraper, which extracts defined terms from passages of text. Many legal documents begin by defining words that are then used throughout the document, and knowing those definitions can be crucial to understanding that document. So it can be helpful to be able to extract a list of terms and their definitions. Definition Scraper needs only be handed a passage of text, and it will determine whether it contains defined terms and, if it does, it will return a dictionary of those terms and their definitions.

Running this passage through Definition Scraper:

“The Program” refers to any copyrightable work licensed under this License.
A “covered work” means either the unmodified Program or a work based on the Program.

Yields the following two-entry dictionary:

(
    [program] => “The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.
    [covered work] => “covered work” means either the unmodified Program or a work based on the Program.
)

Definition Scraper is also a core function of The State Decoded, but warrants becoming its own project because it is so clearly useful for applications outside of the framework of The State Decoded.

The decision to spin off these projects was prompted by a report by the John S. and James L. Knight Foundation, the organization that funds The State Decoded, which evaluated the success of their News Challenge winners. They found several common attributes among the more successful funded projects, including this:

Projects that achieved strong use and adoption of their code often built and released their software in individual components, knowing that certain elements had value and a wide range of uses beyond the main focus of their project.

As development on The State Decoded continues, we may well spin off more mini-projects, if it becomes clear that more components of the overall project could be useful stand-alone tools.

Version 0.6 Released

Version 0.6 of The State Decoded is now available on GitHub. This release is a really exciting one—it establishes a public API for State Decoded sites and creates a standard XML format for importing laws! This is an important release of The State Decoded, one that stands to increase significantly the accessibility of the project to developers, both within the software and without. A total of 23 issues were resolved, nearly all of which are towards those two goals.

Public API

The State Decoded now has a fully fleshed out RESTful, JSON-based API. It has three methods: Law, Structure, and Dictionary. Law provides all available information about a given law. Structure provides all available information about a given structural unit (the various organizational units of legal codes—”titles,” “chapters,” “parts,” etc.). And Dictionary provides the definition (or definitions) for a term within a legal code. The data for these comes directly from the internal API that drives the site—what’s available publicly is what drives the site privately. In fact, I’m toying with the idea of having the site consume its own API, using internal APIs solely to serve data to the external API, and having every other part of the site get its data from that external API.

For a quick start trying this out on the Virginia site, you can use the trial key, 4l6dd9c124ddamq3 (though don’t build any applications using that key, or they will break when it expires), and see the API documentation to put it to work. For a really quick start, you can just browse the Code of Virginia via the API, check out a list of definitions, or read the text of a law. If you decide that you like what you see, register for a key and put this API to work.

Personally, this is the release that I’ve been waiting for. There’s an extent to which the purpose of The State Decoded project is really just to provide an API for legal codes; the fact that there’s a pretty website atop that API is just icing on the cake.

XML Format

A significant obstacle to implementing The State Decoded has been the need to customize the parser for each installation. Every legal code is different—there are no standards—with some all in one big SGML file, others stored in thousands of XML files, and other still needing to be scraped off of webpages. That necessitated modifying the State Decoded parser to interface the data from the source files with the internal API. That’s really not an obstacle to people and organizations who are serious about implementing The State Decoded. But plenty of people might be serious if they could just try it out first. There’s a huge gradient between “huh, looks interesting” and “I must get my city/state/country laws online!” It’s foolish to assume that people won’t just want to try it out first. After all, that’s how I prefer to get started with software and projects.

The solution was to establish an XML standard for importing legal codes into The State Decoded, to provide a low-barrier-to-entry path in addition to the more complex path. To be clear, this is not an attempt to create an XML standard for legal codes. This is a loosely typed standard, used solely as an import format for The State Decoded. Many legal codes are already stored as XML—that’s the most common file format—so getting those codes into The State Decoded now only requires writing a bit of XSLT. This is a much lower barrier to entry.

The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
<law>
	<structure>
		<unit label="" identifier="" order_by="" level=""></unit>
	</structure>
	<section_number></section_number>
	<catch_line></catch_line>
	<order_by></order_by>
	<text>
		<section prefix=""></section>
	</text>
	<history></history>
</law>

Several of those fields are optional, too. There will certainly be legal codes and organizations for which this won’t do the trick—they’ll need to modify the parser to handle some unusual types of data, fields, third-party data sources, etc. But for most people, this will be a big improvement.

The Code of Virginia is available as State Decoded XML, so if you’ve been considering playing with The State Decoded, it just got a whole lot easier to deploy a test site. Just download that XML and follow the installation instructions.

Thanks to Tom MacWright, Andrew Nacin, Daniel Trebbien, and Chad Robinson for their pull requests, wiki edits, and trouble tickets.

No Love from LexisNexis

This is what the table of contents looks like in LexisNexis’s printed edition of the Code of Virginia:

Lexis Table of Contents

I was a bit stunned the first time I saw this. It’s just word soup. There’s simply no effort to make it legible. No thought has gone into this. There is, in short, no love.

I feel like, as a culture, we basically understand how to make tables of contents. Right? Grabbing a few books off my desk, more or less at random, I thought I’d compare Lexis’s table of contents to those of others. Here’s The Chicago Manual of Style:

Chicago Table of Contents

Designing with Web Standards, by Jeffrey Zeldman and Ethan Marcotte:

Zeldman Table of Contents

And Robert Bringhurst’s The Elements of Typographic Style:

Bringhurst Table of Contents

These are all different, but via various small design cues they all manage to accomplish the same thing: they make it easy for somebody to browse through the contents of the text and locate the specific section that they need. Microsoft Word, right out of the box, will happily render a table of contents in styles reminiscent of all of these, with minimal effort.

LexisNexis isn’t even trying. I can’t pretend to know why. But with this as the current state of affairs in the presentation of legal information, it’s trivial for The State Decoded—or anybody with a copy of Word—to improve upon it.

Version 0.5 Released

Version 0.5 of The State Decoded is now available on GitHub. This release is full of general enhancements, and some of them are significant. Twenty-four issues were resolved with this release, including some new features, some significant optimizations, some standardization, and further abstraction of functionality to make it easier to implement.

Here are the most interesting changes:

  • All functionality likely to require customization with each implementation now resides in a state-specific file, rather than being mixed in with core functionality.
  • The beginnings of a templating system are in place, allowing images, CSS, and HTML to be packaged together, in the general direction of how WordPress works.
  • A new method has been added to the Law class, that simply verifies that a given law exists. This has led to a 350% improvement in page rendering times (with the benchmark law, 2,142 milliseconds reduced to 610 milliseconds), a result of the need to verify that every law mentioned in a section actually exists.
  • Several files have been renamed, in order to prevent customizations from being overwritten with upgrades. This is an important step towards providing an upgrade path between versions.
  • Two bulk download files are automatically generated each time the parser is run—a JSON version of the custom dictionary, and a JSON version of the entire legal code.
  • Much has been done towards standardization generally, so that the project adheres to best practices in PHP and MySQL. While this is of little benefit to the end user, for anybody actually getting their hands dirty with code, it should make things much simpler. There’s a lot more to be done to comply with PEAR coding standards, but that’s underway.
  • Virginia attorney James Steele created a print stylesheet to format laws nicely when he printed them out. He was kind enough to contribute that to the project, and printouts of laws are now vastly improved.

Most of these changes are, in one way or another, moving the project towards standardization, automation, and normalization, to make it easier to deploy, maintain, and use. It should all be a lot easier to understand for a programmer diving into it for the first time.

The next release is version 0.6, dedicated to API improvements. That will be comprised of a relatively small number of issues, but they’re big ones: creating a RESTful JSON-based API, and supporting a crudely typed XML input format to simplify the process of parsing new codes. The latter is important, because the present arrangement requires that one know enough PHP to modify the parser to suit their own code’s unique storage and formatting. The idea here is that you can, alternately, use the tools of your choice to create an XML version of that code, and as long as that XML is of the style expected by the parser, it can be imported without having to edit a line of PHP in The State Decoded. Note that v0.6 was supposed to be the release in which the Solr search engine was integrated deeply into the software. That has now been pushed back—it’ll probably be v0.9—in order to accommodate a vendor’s schedule.

Version 0.4 Released

Today, version 0.4 of The State Decoded was tagged on GitHub and bundled up for download, the result of six weeks of work. This release is dedicated (almost) exclusively to enhancements to the dictionary system. Eighteen issues comprise the changes in this release, sixteen of which pertain to the built-in automatic, custom dictionary system, which finds defined terms within legal codes and stores them in a dictionary, using that data to embed contextual definitions that are relevant to each law.

There are a few big changes:

  • The State Decoded comes with a built-in dictionary of general legal terms. Using several different non-copyrighted, government-created legal dictionaries, a collection of nearly 500 terms have been put together, which will help people to understand common legal terms that are rarely defined within legal codes, such as “mutatis mutandis,” “tort,” “pro tem,” and “cause of action.”
  • Dictionary terms are now identified more aggressively, which means that for many states, the size and scope of the custom dictionary is going to expand substantially. In the case of Virginia there was a 49% increase (a leap from 7,681 to 11,504 definitions), a striking difference that could be observed immediately when browsing the site.
  • The problem of nested/overlapping definitions has been solved. When one definition was nested within another (e.g., if we have definitions for both “robbery” and “armed robbery”), then mousing over “robbery” would yield a pair of pop-up definitions, one obscuring the other. Now only the definition for the longest term is defined under those circumstances.
  • Internal terminology has been standardized. In various places the dictionary and its components were all called different things (glossary, definitions, dictionary, terms, etc.) in different places. Now the collection of words is called a “dictionary,” each defined word is a “term,” and the description of that that term means is a “definition.”)
  • The retrieval and display of definitions is substantially faster—they take about half the time that they used to. This is a result of optimizing and simplifying the structure of the database table in which definitions are stored.

A list of all closed issues is available for those who want specifics. And for those who are suckers for details, this is the first release for which a detailed Git commit log is available, with relatively detailed comments for all 68 commits that comprise this release.

This release is two weeks late, almost entirely because of time spent on a pernicious and difficult parsing bug that, it only occurred to me today, shouldn’t have blocked this release because, while an important problem, it has absolutely nothing to do with definitions. (The problem that is being wrestled with is how to handle subsections of laws that span paragraphs. Easy to describe, difficult to solve, at least for those state codes that pretend that a paragraph and a section are one and the same. I’m looking at you, Virginia.) That issue has been moved back to v0.5, and I’ll go right back to wrestling with it on Monday.

Next up, version 0.5 will be another general-enhancements release. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project. Version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly. Version 0.8 will be the user interface release, in which the design will be overhauled, a responsive design will be implemented, serious work will go into the typography, an intercode navigation system will be implemented, contextual help and explanations will be embedded throughout, and the results of some light UI testing will be incorporated. Version 0.9 will be dedicated to optimizations—making everything go faster and be more fault-tolerant, both through improving the code base and supporting the APC and Varnish caching systems. And, finally, version 1.0 will be the first release in which State Decoded becomes a platform that facilitates the sort of analysis and data exchange that makes this project so full of possibility—things like flexible content export, visualizations, user portfolios of interesting laws, and surely lots of other things.

Typeface Authority

With the design process for The State Decoded underway, we’re putting a lot of thought into typography. Helpful to this process has been both Ruth Anne Robbins’ “Painting with print: Incorporating concepts of typographic and layout design into the text of legal writing documents” and Derek H. Kiernan-Johnson’s “Telling Through Type: Typography and Narrative in Legal Briefs.”

Both of those papers are conceptual in nature, so they’re complemented nicely by Errol Morris’ two-part series [1, 2] about the results of a quiz that he ran on the New York Times website, ostensibly measuring readers’ optimism. In fact, he was measuring the impact of different typefaces on readers’ responses. Those who doubt that a typeface could have much of an impact on the credulity of a reader should consider the effect of Comic Sans, which Morris discovered (unsurprisingly) correlated strongly with incredulity on the part of readers. Of the six typefaces that he tested (Baskerville, Comic Sans, Computer Modern, Georgia, Helvetica, and Trebuchet), Baskerville proved the most persuasive. The effect was small, but significant.

This is the sort of consideration that is clearly lacking in the present rendering of laws, both online and in print. (Typographically, LexisNexis’s printed state codes are a train wreck.) It’s also precisely the consideration that will set apart those sites based on The State Decoded, or anybody who cares to employ the project’s stylesheets. There will be more news about this ongoing design work in the weeks ahead.

Version 0.3 Released

Just one week after the release of version 0.2 of The State Decoded comes version 0.3. (You can download it as a 308 KB tarball.) It consists of 23 general enhancements, notably including:

  • The parser that imports legal codes and populates the site has been simplified to be both non-working and a significantly more useful template to be followed to create new state import systems. Previously it was fully functional for importing the Code of Virginia, which was too much detail to serve as a guide.
  • Improved support for and optimization of custom, state-specific functions.
  • Unnecessary chunks of the function library removed, with remaining useful portions of them integrated into other functions.
  • Beginnings of support for APC for variable storage, starting with moving constants into APC.
  • A hook for and sample functionality to turn each law’s history section into a plain-English description of that history, along with links to see the acts of the legislature that made those changes.
  • 404 functionality added for proper error handling of requests for non-existent section numbers and structural units.
  • Added arrow-key based navigation to move to prior and next sections within a single structural container.
  • Provided a sample .htaccess file for supporting a decent URL structure.
  • Moved JavaScript assets out of the general template and into the specific template to eliminate unnecessary code.

As with the prior two releases, this is an alpha release—there’s no installer, documentation, or administrative backend. With this release the gap between the released packages and the version of the software powering Virginia Decoded and Sunshine Statutes (Florida) is smaller than ever, and I’m hopeful that I can port those sites over to run on v0.3 of The State Decoded.

With this release, the project is back on the monthly release schedule that was started in June. A roadmap is emerging for the next few releases. Version 0.4 will be dedicated almost entirely to enhancements to the dictionary system that makes laws self-documenting, and that’s due out on September 1. Version 0.5 will be another general-enhancements release, due out on October 1. Version 0.6 will be the Solr release—the version in which the popular search software becomes integrated deeply into the project, due out on November 1. (Solr functionality, by the way, is made possible by a generous contribution from Open Source Connections, specifically David Dodge, Joseph Featherston, and Kasey McKenna, who recently spent a great deal of time setting up Solr to support The State Decoded.) And version 0.7 will be the API release, where the nascent API gets built out to full functionality and documented properly, and that’s due out on December 1.