Two Mini-Projects: Subsection Identifier and Definition Scraper

The State Decoded project has spun off a couple of sub-projects, components of the larger project that can be useful for other purposes, and that deserve to stand alone. (Both are found on our GitHub repository.)

The first is Subsection Identifier, which turns theoretically structured text into actually structured text. It is common for documents in outline form (contracts, laws, and other documents that need to be able to cross-reference specific passages) to be provided in a format in which the structural labels flow into the text. For example:

A. The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
B. An NRP that has been appointed by the agency may be dissolved by the agency when:
1. There is no longer controversy associated with the development of the regulation;
2. The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
3. The agency determines that resolution of a controversy is unlikely.

One of the helpful features of The State Decoded is that it breaks up this text, understanding not just that every labelled line can stand alone, but also that the final line, despite being labelled “3,” is actually “B3,” since “3” is a subset of “B.” That functionality has been forked from The State Decoded, and now stands alone as Subsection Identifier, which accepts passages of text, and turns them into well-structured text, like such:

(
    [0] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => A
                )

            [prefix] => A.
            [text] => The agency may appoint a negotiated rulemaking panel (NRP) if a regulatory action is expected to be controversial.
        )

    [1] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                )

            [prefix] => B.
            [text] => An NRP that has been appointed by the agency may be dissolved by the agency when:
        )

    [2] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 1
                )

            [prefix] => B.1.
            [text] => There is no longer controversy associated with the development of the regulation;
        )

    [3] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 2
                )

            [prefix] => B.2.
            [text] => The agency determines that the regulatory action is either exempt or excluded from the requirements of the Administrative Process Act; or
        )

    [4] => stdClass Object
        (
            [prefix_hierarchy] => stdClass Object
                (
                    [0] => B
                    [1] => 3
                )

            [prefix] => B.3.
            [text] => The agency determines that resolution of a controversy is unlikely.
        )

)

The second mini-project is Definition Scraper, which extracts defined terms from passages of text. Many legal documents begin by defining words that are then used throughout the document, and knowing those definitions can be crucial to understanding that document. So it can be helpful to be able to extract a list of terms and their definitions. Definition Scraper needs only be handed a passage of text, and it will determine whether it contains defined terms and, if it does, it will return a dictionary of those terms and their definitions.

Running this passage through Definition Scraper:

“The Program” refers to any copyrightable work licensed under this License.
A “covered work” means either the unmodified Program or a work based on the Program.

Yields the following two-entry dictionary:

(
    [program] => “The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.
    [covered work] => “covered work” means either the unmodified Program or a work based on the Program.
)

Definition Scraper is also a core function of The State Decoded, but warrants becoming its own project because it is so clearly useful for applications outside of the framework of The State Decoded.

The decision to spin off these projects was prompted by a report by the John S. and James L. Knight Foundation, the organization that funds The State Decoded, which evaluated the success of their News Challenge winners. They found several common attributes among the more successful funded projects, including this:

Projects that achieved strong use and adoption of their code often built and released their software in individual components, knowing that certain elements had value and a wide range of uses beyond the main focus of their project.

As development on The State Decoded continues, we may well spin off more mini-projects, if it becomes clear that more components of the overall project could be useful stand-alone tools.