How You Can Help

I received an e-mail the other day from somebody asking how he could contribute to the development of The State Decoded. As a rule, this is a sign that I’m doing something wrong. In the spirit of addressing that that, here are a list of relatively self-contained, interesting, diverse features that await addition to The State Decoded, that you or somebody you know might be interested in creating, for folks of all levels of technical knowledge and many fields of expertise.

Create the Functionality to Add Laws to a Portfolio

Site users ought to be able to keep track of laws that are of interest of them. Using jQuery’s localstorage.setItem / localStorage.removeItem, provide the functionality to let people add laws to a portfolio, and then create a page where people can see a list of the laws in their portfolio.
Issue #30

Support Memcached and/or ElastiCache

Provide an option in config.inc.php to provide configuration information to connect to the object cache of choice, and modify class.Law.inc.php to cache laws within the cache upon reading them or, if already cached, read them from there, rather than the database. (Presumably laws should be cached in Memcached upon being requested, rather than pre-loading Memcached full of all laws, since most legal codes aren’t liable to fit within a reasonable amount of server memory.)
Issue #263

Establish an Interface for Showing Diffs of a Law

With each new release of a legal code, we add a new edition (tracked in the “editions” table) and add all of those new laws to the “laws” table. Provide the functionality to let somebody look through the various versions of a law over time, or compare two versions to see how they’ve changed.
Issue #363

Add Word, PDF, and EPUB Export

Add new methods to class.ParserController.inc.php to create, at the time that a legal code is imported, Word, PDF, and EPUB versions of the legal code, and portions thereof. (Realistically, this should be three separate issues, since it’s three separate projects.) Ideally there’d be Word and PDF versions of every law, every structural unit (chapter, title, etc.), and the entire legal code, and then EPUB versions of every structural unit and of the entire legal code.
Issue #50

Provide an Option to Use the OpenDyslexic Font

Create a jQuery-based widget to let somebody enable or disable the use of the OpenDyslexic font, by setting a cookie, and then a jQuery-based widget to toggle the use of that typeface for the body font (article#law) if that cookie is set.
Issue #340

Sync Laws to GitHub

Some folks are pretty psyched about putting laws on GitHub, for various reasons. Create a method that will commit the plain text version of all laws in a given edition of the legal code to a specified GitHub repository, and add the necessary options to `config.inc.php` to enable that.
Issue #161

Provide Vagrant Configurations

There’s an effort underway to create a ready-to-go Vagrant configuration of the project, so that it’s trivial for somebody to set up an implementation of The State Decoded on their own system. This sub-project has its own repository, and a couple of issued logged in its own issue tracker.
Issue #284

Display Related Legal Self-Help Documents

I’ve made a first crack at interfacing with ProBonoNet’s API to gather up a list of all of their free self-help legal documents. This needs to be extended, to store this data in a way that’s available to The State Decoded, and then—here’s the hard part—when somebody looks at a law for which there’s a relevant self-help legal document available, we need to be able to identify that document and display text promoting it. We’ve got the UI elements in place for this, but we just lack the glue that allows us to say “this law about foreclosure is probably related to this guide about what to do if your home is being foreclosed on.” Solr may be a good way to make this match.
Issue #162

Edit, Write, or Propose Changes to Documentation

The State Decoded has some pretty decent documentation that’s under active development, but it would strongly benefit from review by people who aren’t contributors to the project. (People who already know a project in great detail aren’t in a great mindset to write about it in a way that beginners can understand.) The documentation is hosted on GitHub, so pull requests can be made directly, or, for folks who aren’t technical, suggestions or proposed changes can be made in the form of an issue report.
Documentation Repository

This isn’t everything that needs to be done, of course—these are just the interesting, relatively self-contained new features. You can see the complete list of outstanding issues on GitHub, or just the list of new features awaiting creation.

Creating Data Where There is None

At WebLaws.org, Robb Shecter is puzzling through how to deal with the California Code’s curious lack of titles. Most state codes provide a title for each law (known as a “catch line” in most states), such as “Enforcement of child labor law,” “Fees for filing documents or issuing certificates,” or “Money derived from forest reserve.” Not California’s. Robb provides the example of California’s § 459, the law prohibiting burglary. One must read through the law to know what it does. This, of course, makes it very difficult to navigate through the California Code.

The question that Robb asks is what we are to do about this. The problem is abstract for me—I have no immediate prospects of working on the California Code, but Robb has it online now, so the problem is very real for him.

The reason that California has been able to get by with such an odd arrangement is that private legal vendors, like West and LexisNexis, write their own titles for laws. Most attorneys surely use the terminology provided by those companies, some perhaps unaware that those are not official titles. Those titles are copyrighted by those vendors, though, and cannot be used for projects like WebLaws.org or The State Decoded. This means that we must be able to generate our own titles.

Here is the conceptual solution that I arrived at for California some months ago, which I share here in hopes that it might do others some good. I have not implemented this, so while in theory it makes sense, I cannot say for sure that it’ll work.

Like many states, California maintains an annual index of all legislation that has come before their legislature. (This is the 2011–2012 index, for example.) This allows people to look up all bills pertaining, for instance, retirement, and see the following listing:

continuing care retirement communities, AB 748, 1698
pensions—
early distribution penalty waiver, AB 558, 2656
employer-sponsored retirement plans, SB 1234
golden state retirement savings trust, SB 1234
rollover funds, tax-free: medical and long-term care premiums, SJR 21
secure choice retirement savings trust, california, SB 1234
public retirement systems. See name of particular retirement system (e.g., PUBLIC EMPLOYEES’ RETIREMENT SYSTEM).
unemployment compensation benefits, AB 2310

This looks to me like a rich source of titles.

The process is straightforward. First, match up all legislation with the existing law that it proposes to amend. Then, find every entry for all of that legislation in the index of legislation. The description in the index becomes the title of the law. For those laws that have multiple candidate descriptions (either because they’re in the index repeatedly, multiple bills propose to amend them in a given year, or there are many years of attempted amendments), the words that appear most frequently in those descriptions can be used to automatically assemble a title.

This is bound to lead to some goofy titles. And some laws have not had bills introduced that would amend them for decades, and so information about them would not be available in bulk. But in my experience, the laws that most interest people are the ones that legislators attempt to amend, so titles would be provided for those laws that are most liable to be read.

What of the rest of the laws, left untitled by this first method? Statistically improbable phrases (SIPs) are a good backup method. A phrase that occurs in a law that is very rarely found in the rest of California’s laws is liable to be a decent candidate for its title. Again, potentially goofy titles could result, and I have not tested this, but theoretically it could work pretty well. Amazon.com displays SIPs for some of their books, and I think those illustrate the range of results that one could expect from them. For instance, Nora Ephron’s newly re-popular “I Feel Bad About My Neck” has two SIPs: “serial monogamy,” “cabbage strudel.” The former is a not-unreasonable summation of of the book. The latter is obviously pretty unreasonable.

Some experimentation is going to be necessary to arrive at a decent system for generating titles for California’s laws. Ideally, whomever creates them would put them up on Google Docs for some collaborative editing, and release the resulting text under an open license, so that, at last, we will all have titles for all of the laws in the California Code.

This problem is, not incidentally, emblematic of a routine problem with state codes. It seems like they’re all missing something, some core piece of data that would make them far more useful. Each of these will require its own patch, its own work-around, to render those laws widely accessible to the general public. We’re all taking it one state at a time.

How to Decode Law Histories

A rich source of information about laws is found in the history data that accompanies each law in most states, but you’ve probably never noticed it.

For example, Virginia’s Freedom of Information Act has a series of exemptions spelled out in § 2.2-3705.1 which has a cryptic series of numbers listed below the law, in the section titled “History”:

1999, cc. 485, 518, 703, 726, 793, 849, 852, 867, 868, 881, § 2.1-342.01; 2000, cc. 66, 237, 382, 400, 430, 583, 589, 592, 594, 618, 632, 657, 720, 932, 933, 947, 1006, 1064; 2001, cc. 288, 518, 844, § 2.2-3705; 2002, cc. 87, 155, 242, 393, 478, 481, 499, 522, 571, 572, 633, 655, 715, 798, 830; 2003, cc. 274, 307, 327, 332, 358, 704, 801, 884, 891, 893, 897, 968; 2004, c. 690; 2010, c. 553.

Most people’s eyes gloss right over that. (Really, did you read any of that, or just glance at it and acknowledge “yup, that’s a bunch of stuff that means nothing to me…I’ll just skip that and see what he’s got say about it”?) What looks like nonsense to most people turns out to be really rich data, which is simply stored in such a way to render it basically meaningless. Let’s peer inside and see what this means, starting with Virginia.

With Virginia’s history, the first pattern to emerge is that what looks like a long string of numbers is actually broken up into stanzas by semicolons. Here’s the first stanza:

1999, cc. 485, 518, 703, 726, 793, 849, 852, 867, 868, 881, § 2.1-342.01

The first four numbers—1999—are the year in which this section of the code passed into law, at least in its present form. (That was accomplished with then-delegate Chip Woodrum’s HB1985, which overhauled Virginia’s FOIA laws.) And the last string of numbers—§ 2.1-342.01—is the section number that this section had at the time. (Title 2.1 was recodified as Title 2.2 in 2000, which is when this was given its present section number.) In the middle, that series of three-digit numbers (485, 518, 703, etc.) refer to the portion of the Acts of the General Assembly for that year that created or amended this section of the code. The Acts of the General Assembly are sort of like a changelog for the code (but not exactly like a changelog!), in which all of the legislation that passed the General Assembly that year is ordered by the section of the state code that it affects; when multiple bills affect the same portion of the code, they are combined. It’s the intermediate step between a bill and the final, amended code. So here we can see that there were ten portions of the 1999 Acts of the General Assembly that affected this section of the code.

With this as a key, one can step through each stanza in the history of this Virginia law and understand how and when it changed, if not what the substance of those changes was.

Presumably it’s written in this manner to save space in the printed volumes, but obviously it no longer makes sense to codify our laws in a manner optimized for printed volumes. We can do better.

I’m developing a parser for the State Decoded for these history sections, so that rather than displaying this cryptic content, instead the material will be provided in plain English. By storing this data atomically, it’ll be possible to generate a listing of all laws that were amended in a given year, all laws amended by a given portion of the Acts of the General Assembly, or find laws similar to a given law based on their shared history of being amended within the same portion of the Acts. I’m optimistic that it’ll be possible to connect many state codes’ history records back to individual pieces of legislation, rather than just the legislature’s changelog, which opens up a potential wealth of information. (This can already be seen on Virginia Decoded for all changes from 2006 onward, such as in the “Amendment Attempts” listing on § 2.2-3705.1.)

Incidentally, Florida has the same sort of exemptions to its open records law, in s. 119.071, and its history section looks like this:

s. 4, ch. 75-225; ss. 2, 3, 4, 6, ch. 79-187; s. 1, ch. 82-95; s. 1, ch. 83-286; s. 5, ch. 84-298; s. 1, ch. 85-18; s. 1, ch. 85-45; s. 1, ch. 85-86; s. 4, ch. 85-301; s. 2, ch. 86-11; s. 1, ch. 86-21; s. 1, ch. 86-109; s. 2, ch. 88-188; s. 1, ch. 88-384; s. 1, ch. 89-80; s. 63, ch. 90-136; s. 4, ch. 90-211; s. 78, ch. 91-45; s. 1, ch. 91-96; s. 1, ch. 91-149; s. 90, ch. 92-152; s. 1, ch. 93-87; s. 2, ch. 93-232; s. 3, ch. 93-404; s. 4, ch. 93-405; s. 1, ch. 94-128; s. 3, ch. 94-130; s. 1, ch. 94-176; s. 1419, ch. 95-147; ss. 1, 3, ch. 95-170; s. 4, ch. 95-207; s. 1, ch. 95-320; ss. 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 18, 20, 25, 29, 31, 32, 33, 34, ch. 95-398; s. 3, ch. 96-178; s. 41, ch. 96-406; s. 18, ch. 96-410; s. 1, ch. 98-9; s. 7, ch. 98-137; s. 1, ch. 98-259; s. 2, ch. 99-201; s. 27, ch. 2000-164; s. 1, ch. 2001-249; s. 29, ch. 2001-261; s. 1, ch. 2001-361; s. 1, ch. 2001-364; s. 1, ch. 2002-67; s. 1, ch. 2002-256; s. 1, ch. 2002-257; ss. 2, 3, ch. 2002-391; s. 11, ch. 2003-1; s. 1, ch. 2003-16; s. 1, ch. 2003-100; s. 1, ch. 2003-137; ss. 1, 2, ch. 2003-157; ss. 1, 2, ch. 2004-9; ss. 1, 2, ch. 2004-32; ss. 1, 3, ch. 2004-95; s. 7, ch. 2004-335; s. 4, ch. 2005-213; s. 41, ch. 2005-236; ss. 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, ch. 2005-251; s. 14, ch. 2006-1; s. 1, ch. 2006-158; s. 1, ch. 2006-180; s. 1, ch. 2006-181; s. 1, ch. 2006-211; s. 1, ch. 2006-212; s. 13, ch. 2006-224; s. 1, ch. 2006-284; s. 1, ch. 2006-285; s. 1, ch. 2007-93; s. 1, ch. 2007-95; s. 1, ch. 2007-250; s. 1, ch. 2007-251; s. 1, ch. 2008-41; s. 2, ch. 2008-57; s. 1, ch. 2008-145; ss. 1, 3, ch. 2008-234; s. 1, ch. 2009-104; ss. 1, 2, ch. 2009-150; s. 1, ch. 2009-169; ss. 1, 2, ch. 2009-235; s. 1, ch. 2009-237; s. 1, ch. 2010-71; s. 1, ch. 2010-171; s. 1, ch. 2011-83; s. 1, ch. 2011-85; s. 1, ch. 2011-140; s. 48, ch. 2011-142; s. 1, ch. 2011-201; s. 1, ch. 2011-202.

Wow. I’m not yet entirely clear on what all of that means, but I’m getting there.

Bills are Not a Changelog—Why You Can’t Turn Legislation into Laws

There is a common belief that since laws are the result of legislation, then surely one can automatically assemble an amended version of the code based on the bills that have passed the legislature. This is both a really cool idea and wrong.

Your standard narrative of how a bill becomes law doesn’t really cover what it purports to cover. Usually what’s really being explained is how a bill passes, but not how it becomes law. There’s a whole process between the passage of a bill and the encoding of that bill in a state’s codified laws.

Legislatures pass hundreds or thousands of bills every year, some of which are budgetary, some of which are to define their own rules, some of which pertain to state administration, and some of which are resolutions. The remainder are patches to be applied to law, either proposing a new section or amending an existing one (by either adding or removing material). These patches look familiar to anybody who has seen a prettied-up diff, and it’s wholly logical to figure that amending the state’s laws should simply mean collecting all of the bills that pass and applying those patches to the laws.

The trouble is that these patches are not always the last word on the changes that are going to be made to existing law.

Virginia is the state in which I have the most knowledge of this process, so I’ll provide a few words about my state’s process. The Virginia Code Commission is a tiny state agency, with just seven employees, who is charged with overseeing the code. Their duties are spelled out in Title 30 (General Assembly), Chapter 15 (Virginia Code Commission) of the state code, but the interesting bit is § 30-149 (Authority for minor changes to the Code of Virginia):

The Commission may correct unmistakable printer’s errors, misspellings and other unmistakable errors in the statutes as incorporated into the Code of Virginia, and may make consequential changes in the titles of officers and agencies, and other purely consequential changes made necessary by the use in the statutes of titles, terminology and references, or other language no longer appropriate.

The Commission may renumber, rename, and rearrange any Code of Virginia titles, chapters, articles, and sections in the statutes adopted, and make corresponding changes in lists of chapter, article, and section headings, catchlines, and tables, when, in the judgment of the Commission, it is necessary because of any disturbance or interruption of orderly or consecutive arrangement.

The Commission may correct unmistakable errors in cross-references to Code of Virginia sections and may change cross-references to Code of Virginia sections which have become outdated or incorrect due to subsequent amendment to, revision, or repeal of the sections to which reference is made.

The Commission may omit from the statutes incorporated into the Code of Virginia provisions which, in the judgment of the Commission, are inappropriate in a code, such as emergency clauses, clauses providing for specific nonrecurring appropriations and general repealing clauses.

(TL;DR: The Code Commission can make a lot of changes during that in-between period when a bill has passed, but it’s not quite yet a law.)

It is not unusual for the legislature to amend a bill at the very last minute, without proper review. From the marked-up format of a bill (words crossed out, others inserted) what can emerge is grammatically incorrect or even contains logical errors. It’s surely a judgment call whether such problems can be fixed by the Virginia Code Commission or whether it will require the General Assembly to fix them, which may well require a delay of nearly a year.

Here, for instance, is a selection of the 35 changes made to the Code of Virginia by Code Commission staff since the 2011 edition was published:

Section Correction Date
2.2-311 Catchline, after “authority of” change “investigation” to “investigators” 12/1/2011
2.2-515.1 Second sentence, after “responsibility to” add “(i) establish an address confidentiality program in accordance with § 2.2-515.2, (ii)” and change “programs and shall report” to read “programs, and (iii) report” 10/25/2011
2.2-2338 In first paragraph, (i) change “13 voting members” to “12 voting members”; (ii) insert “and” after “Commerce and Trade,”; and (iii) delete “and the Assistant to the Governor for Commonwealth Preparedness” 8/5/2011
2.2-2699.5 Subsection B, replace “Assistant to the Governor for Commonwealth Preparedness” with “Secretary of Veterans Affairs and Homeland Security” 8/5/2011
2.2-4509 Change “AA by Moody’s” to “Aa by Moody’s” 10/25/2011
6.2-314 End of catchline, change “institution” to “institutions” 10/25/2011
6.2-412 End of catchline, change “improvement” to “improvements” 10/25/2011

In Virginia—as in other states, although I don’t know how many—an attempt to use legislation as a changelog for the state code would yield results that would be very convincing-looking, but that would deviate substantially from the official code. Bills must frequently pass through a human filter before they become laws. That’s not something that you can simulate with a Ruby gem or a PEAR package. Some things just require some thought, in ways that can’t yet be automated.

The Messiness of Real-World Data

Those of us who work with big data have a tendency to describe working with it in cavalier terms. “Oh, I just grabbed the XML file, wrote a quick parser to turn it into CSV, bulk loaded it into MySQL, laid an API on top of it, and I was done.” The truth is that things very rarely go so well.

Real-world data is messy. Data doesn’t convert correctly the first time (or, often, the tenth time.) File formats are invalid. The provided data turns out to be incomplete. Parser code that was so straightforward when written for the abstract concept of this data quickly turns into a series of conditionals to deal with all of the oddities of the real data.

While I’ve been developing the parser for The State Decoded, it’s become obvious that state laws themselves are too messy to standardize entirely, and the data formats in which states provide those laws are, in turn, too messy to import easily.

As a case study of the messiness of real-world data, here are some of the challenges I’ve encountered in parsing state legal codes.
 

Encoding Errors

A precious few states provide their state codes as bulk data. While it might seem like a real gift to get an XML file of every state law, if the XML is invalid, then that’s really more of a white elephant. One state provided me with SGML that they had, in turn, been provided with by LexisNexis. It was riddled with hundreds of errors, and could not be parsed automatically. After hours of attempting to fix problems by hand, I finally threw in the towel. Weeks later, LexisNexis provided a corrected version, and my work could continue.

Often, working with big data means doing pathbreaking work. The result is that sometimes nobody has ever before attempted to do anything with the data sets in question. Assembled by well-meaning but inexperienced people, those data sets may consequently be encoded incorrectly.

Changing Realities

State laws are occasionally restructured, renumbering huge portions of the code. Virginia’s entire criminal code—Title 18.1— became Title 18.2 about fifteen years ago. No redirects exist in 18.1, no pointers to the new location, no sign of what was. One must simply know that it changed. Court cases, articles from legal journals, or attorney generals’ opinions that cited sections of code within 18.1 are thus either useless or must be passed through a hand-crafted filter to point the citation to the new section number.

It would be nice if reality would consent to remain static to ease the process of cataloging it. But the world changes, and data reflects those changes. That can make it awfully frustrating to parse and apply that data, but that’s just the price of admission.

Inconsistencies in the Data

There are at least a few states who violate their standard state code structure. They might structure their code by dividing it into titles, each title into chapters, and each chapter into sections. Except, sometimes, when chapters are called “articles.” Why do they do this? I have no idea. If lawmakers consulted with database developers prior to recodifying their state’s laws, no doubt our legal codes would be normalized properly.

These inconsistencies might be illogical, but they’re how it is, and must be reflected in the final application of the data. This can be particularly frustrating if the provided bulk data doesn’t record these inconsistencies internally, requiring the gathering of external information to be applied to the data, as is often the case.

Missing Data

One state’s code contains periodic parenthetical asides like “See Editor’s note.” What does the editor’s note say? There’s no way to tell—it’s not part of the bulk data or the state’s official website for their legal code. Those editor’s notes will have to be obtained from the state’s code commission, which will probably come in the form of a bunch of Word files attached to an e-mail.

Not all data exists electronically, and not all data that exists electronically exists in a single location. Often, piecing together a meaningful data set requires gathering information from disparate sources, sometimes in awkward ways. And sometimes the last few bits of data just aren’t available, and the data set is going to have to be incomplete.
 

All of these problems are solvable, in one way or another, but those solutions can be time-consuming. The ratio of the Pareto principle applies here: one is liable to get 80% of the data set whipped into shape in the first 20% of the time. The remaining 20% of the data will require the remaining 80% of the time. That first 80% feels magical—everything just falling into place—but that last 20% is just plain hard work.

Real-world data is messy. Working with big data means cleaning it up.