Version 0.8 Released

Version 0.8 of The State Decoded is now available on GitHub. This is our biggest release to date, in part because it’s a combination of what was meant to be both the v0.8 and the v0.9 releases. That means that it took twice as long to produce this release as was planned, but it was worth it. It’s comprised of 709 Git commits, more changes that were committed for all prior eight versions combined. This is the final major release prior to v1.0. The two biggest improvements are a total overhaul of the user interface with a responsive design and the integration of the Apache Solr search engine. Here is a rundown of the major changes:

New User Interface

v8John Athayde and Lynn Wallenstein of Meticulous were responsible for dropping the old design and creating a new one from scratch. They’ve rendered the site almost unrecognizably different, which a highly modular, flexible design that looks good on screens of all sizes, is easy to customize, and will continue to grow with the State Decoded project. The layout is coded in SASS, which is handy for a lot of designers, and employs a pluggable template system that makes it easy to drop in custom designs. It even has a nice print stylesheet for folks who still like to have a hard copy in their hand. There’s no overstating what a huge improvement that this is.

Search Engine

It’s really insulting to call this a mere search engine. The team at Open Source Connections—John Berryman and Doug Turnbull, in specific—designed an implementation of Apache’s Solr search platform that’s optimized for legal data. Of course, we used this to add a site search engine with spellcheck and live search suggestions, but having legal data indexed by Solr (and the underlying Lucene system) facilitates all sorts of exciting new features in the realms of natural language processing, information retrieval, content recommendations, and machine learning.

Code Base Overhaul

The complexity of The State Decoded and the needs of its users outgrew the code base and data structure that had functioned from v0.1 through v0.7. Long-time State Decoded contributor Bill Hunt took on this project, completing it in just a few weeks, implementing a routing table, moving a lot of functionality into controllers, and routing all queries to a permalinks table. The old approach looks pretty clunky compared to what Bill built.

Setup Simplified

We’ve tested out The State Decoded on the major Linux distributions and hosting platforms, identified all of the changes that were needed to allow it to work smoothly in those environments, and modified the software accordingly. There’s now an automated environmental test suite, to make sure that The State Decoded will work properly, with multiple paths of accomplishing the same task, to require as little work as possible to configure the server. The installation process has fewer steps than ever, as everything that can possibly be automated is automated.

Plus

There’s an entire workflow to handle new editions of codes, bulk downloads are created automatically, a sample XSLT is included for State Decoded XML, there’s a default home page that doesn’t require any customization, it now supports non-unique section identifiers (!), there’s an API method for search, there’s a proper administrative section now, we have an assets repository for the design’s Photoshop files, a sitemap.xml is automatically generated, every law now has Dublin Core embedded, and dozen of other things.

The work done by Bill Hunt, Meticulous, and Open Source Connections wouldn’t be possible without the generous support of The John S. and James L. Knight Foundation, whose funding made it possible to hire them. And Bill’s ongoing work on The State Decoded is courtesy of his employer, The OpenGov Foundation, who now employes him to contribute to the project and to implement The State Decoded in cities and states around the U.S.

Thanks, too, to Chris Birk, Karl Nicholas, Nick Skelsey, Josh Tauberer, and Rostislav Tsiomenko for their suggestions, testing, and contributed code.

Documentation as a Continuous Process

The plan was simple: we’d create The State Decoded, and then we’d write the documentation for it. The documentation would be a big HTML file or something, and we’d include a copy with the download, and also host a copy on statedecoded.com. That’s a pretty standard model for open source software, and it seemed best to emulate what others did.

That was a terrible plan.

Documentation, as it turns out, is a lot like software. Documentation is a process, not a product. This realization wasn’t born of a single ah-ha moment, but instead out of the needs made obvious during the process of creating The State Decoded.

At first, project documentation was a few bare-bones notes on a GitHub wiki, so that folks would know how to install the software. But then we established a State Decoded XML format, for storing laws, which required a lot of documentation for internal use—that went up on the wiki, too. When we added an API to the software, that definitely needed documentation, and so that was another page on the wiki.

Things were getting unwieldy. As more people started contributing to the project, not only did more people need to understand how the software worked, but it also became hard to keep up with improvements to the code. It’s easy to understand how software works when you wrote all of it, but it’s rather harder when big portions of it were authored by others.

Perhaps most important, it needed to be easier for early adopters to try out The State Decoded before its v1.0 release, and that meant providing documentation that was sufficient for that task. Releasing software throughout the development process, in the open source style, means that documentation needs to be released throughout, too.

In short, open source software requires open source documentation.

There are some fine services available for hosting software documentation, but I didn’t have to cast around long to find a clear path forward for The State Decoded: GitHub Pages. Several people recommended it, so it seemed worth trying. GitHub Pages allows HTML and Markdown stored on GitHub to be published to a public website, using Jekyll, an open source project. As with any GitHub repository, others can make pull requests to propose modifications to those web pages, or be authorized to make any modifications that they like. When those changes are accepted or saved, they’re reflected on the public website. Thanks to third-party services like Prose.io, somebody need not even know how to use Git or GitHub, or even HTML or Markdown, in order to edit the documentation. This is a marked improvement over wiki-hosted documentation. GitHub Pages is GitHub’s

GitHub’s setup process walked me through creating a State Decoded documentation repository, to which I moved the files from the wiki. Minutes later, the project had a documentation website.

In the intervening few months, the documentation has been modified every time that a notable change was made, hundreds of times in all. That said, it’s still basically a dressed-up version of the wiki-based mess of a year ago. It needs to be restructured, ordered, and edited. But now it’s on a platform that facilitates those improvements, by making it part of my desktop development workflow.

As long as The State Decoded remains an actively maintained project—for years to come, I hope—it needs to have documentation that reflects that. That requires documentation that changes with the software, and that is part of the same ecosystem as the software itself. Hosting software documentation on GitHub and publishing it via Jekyll is the right path for The State Decoded, and I think it’s a route that other open source projects would be wise to follow, too.

New Site: San Francisco Decoded

Today the OpenGov Foundation launched San Francisco Decoded, their State Decoded-powered website that puts the laws of San Francisco online. The site is possible because the San Francisco Mayor’s Office of Civic Innovation provided the raw text of their laws, which the growing team at OpenGov used as the raw material to power the website. Bravo to SF CIO Jay Nath, his team, and the folks at OpenGov for their great work in making this happen.

Don’t Let the Perfect Be the Enemy of the Good

The basic concept behind The State Decoded is both simple and obvious: create a platform to display laws in a nice, understandable way, using the data already present in those laws. So why hadn’t anybody done it before?

Because it’s too hard to do perfectly.

The State Decoded is not perfect, by design. Its definition scraper may never identify 100% of defined words within legal codes, because legislators are inconsistent about how they write laws. Cross-references will not always been identified and linked, because they’re legislators are inconsistent about how they write those, too. Some laws’ hierarchical structures may not be indented properly, because they’re labelled inconsistently. The State Decoded’s interface with court ruling APIs may never return only the court rulings that affect a specific law, as opposed to rulings that merely mention a law, because courts don’t provide metadata.

I’m not sure that it’s technologically possible, at present, to solve these and a dozen other problems inherent in The State Decoded. Some very bright minds have looked at creating a State Decoded-like system over the past few years, and decided against it because of insurmountable obstacles. They were right about the scope of the problems, but I think they were wrong to conclude that they shouldn’t proceed anyway, and create a system that’s 99% of the way there.

Just look at a few state code websites. I picked a few out at random: South Carolina, Missouri, New York, and Maine. These are awful. Just terrible. The percentage of citizens who are capable of navigating and understanding these is a rounding error. Having embedded definitions for 99% of legal terms, cross-references linked 99% of the time, and linked court cases that are good-not-great—that’s all far, far better than what the citizens of these states have access to right now.

Sir Robert Alexander Watson-Watt, the radar pioneer who created England’s system to detect approaching Luftwaffe, said of England’s radar system: “give them the third best to go on with; the second best comes too late and the best never comes.”

The State Decoded is the third best system. The second best doesn’t exist yet, but I look forward to somebody creating it and obviating The State Decoded. The best may never come. And that’s OK.

How You Can Help

I received an e-mail the other day from somebody asking how he could contribute to the development of The State Decoded. As a rule, this is a sign that I’m doing something wrong. In the spirit of addressing that that, here are a list of relatively self-contained, interesting, diverse features that await addition to The State Decoded, that you or somebody you know might be interested in creating, for folks of all levels of technical knowledge and many fields of expertise.

Create the Functionality to Add Laws to a Portfolio

Site users ought to be able to keep track of laws that are of interest of them. Using jQuery’s localstorage.setItem / localStorage.removeItem, provide the functionality to let people add laws to a portfolio, and then create a page where people can see a list of the laws in their portfolio.
Issue #30

Support Memcached and/or ElastiCache

Provide an option in config.inc.php to provide configuration information to connect to the object cache of choice, and modify class.Law.inc.php to cache laws within the cache upon reading them or, if already cached, read them from there, rather than the database. (Presumably laws should be cached in Memcached upon being requested, rather than pre-loading Memcached full of all laws, since most legal codes aren’t liable to fit within a reasonable amount of server memory.)
Issue #263

Establish an Interface for Showing Diffs of a Law

With each new release of a legal code, we add a new edition (tracked in the “editions” table) and add all of those new laws to the “laws” table. Provide the functionality to let somebody look through the various versions of a law over time, or compare two versions to see how they’ve changed.
Issue #363

Add Word, PDF, and EPUB Export

Add new methods to class.ParserController.inc.php to create, at the time that a legal code is imported, Word, PDF, and EPUB versions of the legal code, and portions thereof. (Realistically, this should be three separate issues, since it’s three separate projects.) Ideally there’d be Word and PDF versions of every law, every structural unit (chapter, title, etc.), and the entire legal code, and then EPUB versions of every structural unit and of the entire legal code.
Issue #50

Provide an Option to Use the OpenDyslexic Font

Create a jQuery-based widget to let somebody enable or disable the use of the OpenDyslexic font, by setting a cookie, and then a jQuery-based widget to toggle the use of that typeface for the body font (article#law) if that cookie is set.
Issue #340

Sync Laws to GitHub

Some folks are pretty psyched about putting laws on GitHub, for various reasons. Create a method that will commit the plain text version of all laws in a given edition of the legal code to a specified GitHub repository, and add the necessary options to `config.inc.php` to enable that.
Issue #161

Provide Vagrant Configurations

There’s an effort underway to create a ready-to-go Vagrant configuration of the project, so that it’s trivial for somebody to set up an implementation of The State Decoded on their own system. This sub-project has its own repository, and a couple of issued logged in its own issue tracker.
Issue #284

Display Related Legal Self-Help Documents

I’ve made a first crack at interfacing with ProBonoNet’s API to gather up a list of all of their free self-help legal documents. This needs to be extended, to store this data in a way that’s available to The State Decoded, and then—here’s the hard part—when somebody looks at a law for which there’s a relevant self-help legal document available, we need to be able to identify that document and display text promoting it. We’ve got the UI elements in place for this, but we just lack the glue that allows us to say “this law about foreclosure is probably related to this guide about what to do if your home is being foreclosed on.” Solr may be a good way to make this match.
Issue #162

Edit, Write, or Propose Changes to Documentation

The State Decoded has some pretty decent documentation that’s under active development, but it would strongly benefit from review by people who aren’t contributors to the project. (People who already know a project in great detail aren’t in a great mindset to write about it in a way that beginners can understand.) The documentation is hosted on GitHub, so pull requests can be made directly, or, for folks who aren’t technical, suggestions or proposed changes can be made in the form of an issue report.
Documentation Repository

This isn’t everything that needs to be done, of course—these are just the interesting, relatively self-contained new features. You can see the complete list of outstanding issues on GitHub, or just the list of new features awaiting creation.

First Documentation Release

The first draft of The State Decoded’s documentation is now available. Documentation was being built up piecemeal on a GitHub wiki, but it’s been moved to its own GitHub repository and a dedicated website. The pages are created in a mix of HTML and Markdown (Markdown has difficulty with some of the sample XML), and the site is built in Jekyll, which means that any changes made in the documentation GitHub repository are reflected promptly on the documentation website. This makes it simple for anybody to update the documentation—to fix a mistake, add an example, or even add a whole new section.

There’s a great deal more to be done with the documentation. It needs to be organized, structured narratively, enhanced with illustrations, and simply cover more material. But it’s not bad, and now it’s easy for others to help make it better.

Version 0.7 Released

Version 0.7 of The State Decoded is now available on GitHub. This is a really meaty release, dedicated entirely to optimizations: it’s faster, more efficient, easier to extend, easier to contribute to, easier to deploy, and easier to navigate. This release is comprised of a whopping 353 Git commits—that’s more than every commit that went into versions 0.1 through 0.6, combined. Here are some of the major changes:

Tuning

Every line of code was reviewed to see how it could be made faster—in tiny ways (instances of stristr() replaced with strstr()—or, better, strpos()) and in large ways (tossing out whole methods and starting again). The indices in MySQL were evaluated and revised, and any PHP error of level E_NOTICE and above was quieted. And the parser’s memory usage has been reduced substantially, making it faster and more efficient to process large legal codes that previously might have strained (or broken) Apache’s per-process memory limit.

Caching

There are now hooks for both APC and Varnish, so that folks running either of those popular caching applications (or, better, both of them) can reap the speed benefits. API keys, all constants, and templates are cached in APC now, with more caching on tap in upcoming releases.

Development Environment

This version was developed substantially within a Vagrant staging environment, resulting in the inevitable optimization of The State Decoded to run within Vagrant. That involved a lot of tiny changes (e.g., respecting port numbers in URLs) that collectively create a smooth experience when developing locally. We have the under-development Vagrant configuration for The State Decoded in its own repository. This version and all future versions will have a Vagrant machine image available for download, to make it trivial to get started. Vagrant is working on a path to deploy Vagrant machine configurations as AWS instances, which is why this is a bandwagon worth hopping on now.

Standardization

Out with HTML Purifier, in with HTML Tidy. Out with MDB2, in with PDO. Out with late-nineties-style commenting and code formatting, in with PEAR-style commenting and code formatting. There was nothing wrong with any of those prior approaches, but it’s best to establish an environment that contributors expect—that makes it easier for folks to contribute code to the project, or customize it for their own website. Also, HTML Tidy and PDO are already installed by default on a great many systems, which simplifies the setup process.

Extensibility

Several steps have been made to facilitate customization. All non-obvious database columns are now commented, there’s infrastructure for inline help text (stored and distributed as JSON), and there’s support for importing, storing, and display arbitrary metadata fields alongside the standard data about each law.

New Features

And, of course, we couldn’t resist a few new features. There’s now keyboard navigation within laws and structures, for those power-users who want to flip through laws quickly. There’s baked-in support for Disqus-based commenting on each law page—just enter your site’s Disqus shortname in config.inc.php and you’re up and running. And, finally, there’s bulk generation of both plain text and JSON versions of laws. To what end? Dunno—that’s for you to figure out.

This release involved a lot of work over the course of four months. Bill Hunt has scrubbed in to help as a core contributor, and he’s responsible for a lot of these improvements. Some very helpful bug reports, wiki edits, and pull requests came from Chris Birk and Daniel Trebbien.

Next up: versions 0.8 and 0.9, which will be released soon, and close together. Version 0.8 will be comprised of the UI/UX branch, which only has a handful of small tasks remaining, thanks to months of work by Meticulous (John Athayde and Lynn Wallenstein). And Version 0.9 will be comprised of the Solr branch, and is dedicated to baking Apache Solr into The State Decoded. That’s based on several months of work by the team at OpenSource Connections, who finished up earlier this week—there are only about a dozen outstanding issues, all of which build on OSC’s work to add some valuable new features to The State Decoded.

Bulk Downloads of Five State Codes

A very real obstacle to putting up state code websites is getting a copy of that state’s laws. For example, there’s a New Jersey group that wants to set up The State Decoded for their state. But, like most states, New Jersey doesn’t provide bulk downloads—it’s not possible to simply get a raw copy of the files. The backup option is what’s known as “screen-scraping”—having software load every single law on the official state law website, one by way, and copy the laws from there. This is a terrible solution, but it’s all that’s available in most U.S. states. The New Jersey statutes website is distinctly un-scrapeable. I don’t know that it’s impossible, but it would be an unpleasant task.

Today, Carl Malamud of Public.Resource.org tweeted the news that he’s got five new state codes online as bulk data:

In addition to bulk machine-readable files, they’re also available in a variety of file formats on Archive.org. They join the Maryland and Washington D.C. codes that he’s already made available as bulk downloads. (Maryland Decoded is up now, and the Open Law DC project has a great site for their code, with a State Decoded implementation under development that’ll be the subject of a hackathon on Saturday’s National Day of Civic Hacking.)

Now the onus is on folks in Arkansas, Colorado, Georgia, Idaho, and Mississippi to set up to the plate and put this data to work. Who’s going to implement The State Decoded in these states?

Maryland Decoded

A new State Decoded site launched today: Maryland Decoded. A project of the OpenGov Foundation, they’re doing some innovative stuff on the still-under-development platform. For instance, they’re crowd-sourcing “catch lines”—the titles that most states apply to their laws. Maryland does not have catch lines, so instead of having a law titled “Murder in the First Degree,” they simply have GCR § 2-201. Solution? Anybody can suggest a catch line, and they’ll build up their own catch lines, gradually.

Every state presents its own set of challenges and opportunities. The OpenGov Foundation is capitalizing on the opportunities to overcome the challenges and helping to improve The State Decoded for those who will follow in their path.

Washington D.C. and the Work Ahead

On Greater Greater Washington, Tom MacWright recently wrote a blog entry highlighting the problems of access to the Washington D.C. Code. There is, first, a legal obstacle: Washington D.C. claims copyright over their laws, which is to say that it is illegal to reproduce them without permission of the city. Then, second, what is perhaps a more significant obstacle: they outsource the maintenance of their legal code.

The city of Washington D.C. long ago started paying WestLaw—and now LexisNexis—to turn the D.C. Council’s bills into laws. As a result, they now have neither the knowledge nor the infrastructure to maintain their own laws. The only way that D.C. can find out what their laws say is to pay LexisNexis to tell them. This is consequently true for the public, as well. If a resident of D.C.—like MacWright—wants to know what the law says, there’s no sense in asking (or FOIAing) the city, because the city has outsourced the process so completely that they know nothing.

MacWright has a few options to know what the law says. The first is to travel to a library on each occasion that he wants to know something (assuming he can find one that has a current copy of the DC Code), and read it there. The second is that he can buy a copy, for $867.00. And the third is that he can use the DC Code website, maintained by WestLaw, which is every bit as awful as any other state code website.

So how is the D.C. Code to get the State Decoded treatment? How can a digital copy be imported into the software, for the general public benefit? It can’t be FOIAed from D.C. Council, since they don’t have it. It’s clearly impractical to scan in 25 volumes of hardbound books. Normally that would leave scraping the website, but WestLaw’s website has a EULA that prohibits copying material off of the website. WestLaw has been hired to do for the Washington D.C. government what they cannot or will not do for themselves—post laws to the web—and because they choose to impose copyright restrictions, that is a legal barrier preventing that material from being reused.

Normally, this would be the end of the road—Washington D.C. would have cut off their code from being improved (or even reused) in any way by third parties. In this case, though, the story has a different ending. Public.Resource.Org has taken the surprising and admirable tack of purchasing all of the volumes of the D.C. Code, slicing them up, scanning them in, OCRing them, and distributing them for free.

DC Code Assembly Line

Deglued bundles on the bottom right awaiting their turn on the two high-speed scanners. When they are completed, the bundles are put upper right to await a QA pass.

All of the volumes can be downloaded as PDFs or, via the Internet Archive, in nearly any other file format one can think of. For those who would like a print copy for a more reasonable price, print-on-demand service Lulu sells each volume for just $12.

Lest the motivations of Public.Resource.Org be unclear, a “Proclamation of Digitization” accompanies the release, citing a pair of Supreme Court rulings (“the authentic exposition and interpretation of the law, which, binding every citizen, is free for publication to all, whether it is a declaration of unwritten law, or an interpretation of a constitution or a statute”) and declaring that “any assertion of copyright by the District of Columbia or other parties on the District of Columbia Code is declared to be NULL AND VOID as a matter of law and public policy as it is the right of every person to read, know, and speak the laws that bind them.” The organization mailed out elaborate packages, containing portions of the D.C. Code, to announce its availability. (You can see my own unboxing photos.)

All that remains is for somebody to marry this source of data with The State Decoded as, indeed, somebody is already talking about doing. The D.C. Council or WestLaw may not be happy about this—it’s quite possible that one or both entities will take legal action to halt this—but I’m confident that it will be found that the law supports making those very laws public.