Creating Data Where There is None

At WebLaws.org, Robb Shecter is puzzling through how to deal with the California Code’s curious lack of titles. Most state codes provide a title for each law (known as a “catch line” in most states), such as “Enforcement of child labor law,” “Fees for filing documents or issuing certificates,” or “Money derived from forest reserve.” Not California’s. Robb provides the example of California’s § 459, the law prohibiting burglary. One must read through the law to know what it does. This, of course, makes it very difficult to navigate through the California Code.

The question that Robb asks is what we are to do about this. The problem is abstract for me—I have no immediate prospects of working on the California Code, but Robb has it online now, so the problem is very real for him.

The reason that California has been able to get by with such an odd arrangement is that private legal vendors, like West and LexisNexis, write their own titles for laws. Most attorneys surely use the terminology provided by those companies, some perhaps unaware that those are not official titles. Those titles are copyrighted by those vendors, though, and cannot be used for projects like WebLaws.org or The State Decoded. This means that we must be able to generate our own titles.

Here is the conceptual solution that I arrived at for California some months ago, which I share here in hopes that it might do others some good. I have not implemented this, so while in theory it makes sense, I cannot say for sure that it’ll work.

Like many states, California maintains an annual index of all legislation that has come before their legislature. (This is the 2011–2012 index, for example.) This allows people to look up all bills pertaining, for instance, retirement, and see the following listing:

continuing care retirement communities, AB 748, 1698
pensions—
early distribution penalty waiver, AB 558, 2656
employer-sponsored retirement plans, SB 1234
golden state retirement savings trust, SB 1234
rollover funds, tax-free: medical and long-term care premiums, SJR 21
secure choice retirement savings trust, california, SB 1234
public retirement systems. See name of particular retirement system (e.g., PUBLIC EMPLOYEES’ RETIREMENT SYSTEM).
unemployment compensation benefits, AB 2310

This looks to me like a rich source of titles.

The process is straightforward. First, match up all legislation with the existing law that it proposes to amend. Then, find every entry for all of that legislation in the index of legislation. The description in the index becomes the title of the law. For those laws that have multiple candidate descriptions (either because they’re in the index repeatedly, multiple bills propose to amend them in a given year, or there are many years of attempted amendments), the words that appear most frequently in those descriptions can be used to automatically assemble a title.

This is bound to lead to some goofy titles. And some laws have not had bills introduced that would amend them for decades, and so information about them would not be available in bulk. But in my experience, the laws that most interest people are the ones that legislators attempt to amend, so titles would be provided for those laws that are most liable to be read.

What of the rest of the laws, left untitled by this first method? Statistically improbable phrases (SIPs) are a good backup method. A phrase that occurs in a law that is very rarely found in the rest of California’s laws is liable to be a decent candidate for its title. Again, potentially goofy titles could result, and I have not tested this, but theoretically it could work pretty well. Amazon.com displays SIPs for some of their books, and I think those illustrate the range of results that one could expect from them. For instance, Nora Ephron’s newly re-popular “I Feel Bad About My Neck” has two SIPs: “serial monogamy,” “cabbage strudel.” The former is a not-unreasonable summation of of the book. The latter is obviously pretty unreasonable.

Some experimentation is going to be necessary to arrive at a decent system for generating titles for California’s laws. Ideally, whomever creates them would put them up on Google Docs for some collaborative editing, and release the resulting text under an open license, so that, at last, we will all have titles for all of the laws in the California Code.

This problem is, not incidentally, emblematic of a routine problem with state codes. It seems like they’re all missing something, some core piece of data that would make them far more useful. Each of these will require its own patch, its own work-around, to render those laws widely accessible to the general public. We’re all taking it one state at a time.

2 thoughts on “Creating Data Where There is None

  1. Hey Waldo,

    You wrote a great description of the problem. Your idea about the CA bills is interesting but I think there’s an insurmountable problem: each bill maps to many statutes.

    I like the idea of scavenging names from unconventional sources like these, though, including textual analysis. I’ve thought about doing something like that and then using it as input to a captcha-type voting system. People would be asked to vote up the best candidate title.

    Now, similar to your CA bills idea, I found another small but steady source of titles: The California Law Revision Commission processes a steady stream of CA law, implementing various kinds of fixes and improvements. And I discovered that they create titles for internal use, publishing them in their reports. The content is unfortunately buried in PDFs, and not highly structured, but it’s a possibility. Take a look at this, beginning at page 43: http://clrc.ca.gov/pub/Printed-Reports/Pub235-AR.pdf

  2. Your idea about the CA bills is interesting but I think there’s an insurmountable problem: each bill maps to many statutes.

    I don’t think that’s actually an insurmountable problem. And unlike the rest of my proposal, this bit I say from experience, because I already map legislation to laws on Virginia Decoded, using Richmond Sunlight (which I also run, conveniently :). There are a great many bills that map only to one statute. In Virginia, something like half of all bills that affect the Code of Virginia affect just one section of the code. From an admittedly brief perusal of California State Legislature’s website, I’ve found examples there of bills that affect just one statute. I think a large sample would be necessary to determine the number of bills liable to exist that affect just one statute, though. It’s quite possible that the total number is too low to be valuable. I also suspect that there are clever solutions waiting to be employed to deal with legislation that affects multiple sections (given a large enough sample size that affect a very small number of sections, for instance), but I sure haven’t thought that through.

    I’ve thought about doing something like that and then using it as input to a captcha-type voting system. People would be asked to vote up the best candidate title.

    Good idea!

    The California Law Revision Commission processes a steady stream of CA law, implementing various kinds of fixes and improvements. And I discovered that they create titles for internal use, publishing them in their reports. The content is unfortunately buried in PDFs, and not highly structured, but it’s a possibility.

    Wow, you’re right—that’s really great stuff! It might be in PDFs, but at least it’s actual text in there, instead of images, so it’s imminently scrape-able. Given a few years of annual reports, I’ll bet you could get a huge chunk of the California Code covered! You might consider asking California Law Revision Commission if they have a listing of titles, either for the entire code or just for all of the sections that they have had cause to assign titles to. They may well have a big spreadsheet they’d be happy to give you. My experience with the equivalent Virginia organization has been overwhelmingly positive—they’re eager to give me all of the information that I ask for and more. Perhaps California would be as forthcoming?

Comments are closed.