Advanced Research in Government Operations Laboratories or ARGO Lab for short, is a leading civic data science nonprofit that has been featured by Fast CoExist, the Fox News Smart Cities initiative, and a half dozen newspapers in New York where it is headquartered. This organization uniquely brings together civic data science and governmental operational expertise together to improve public service delivery. Cities typically have a lengthy procurement cycle which solicits bids and develops specifications which tend to assume each technology component must be individually developed. The bid process can also be quite expensive, and once a bid has been accepted, it may take two years or longer for a project to reach production. In business, two years encompasses and entire technological cycle in which product and entire systems become mature and become obsolete. In contrast to this model, ARGO's team utilized widely available commercial off the shelf technologies to build a project platform and develop a prototype product. This example demonstrates the power of an agile approach which employs interoperable and open technologies.
Consider an age old problem plaguing New York and most other cities: the pothole. Causing damage to cars and injury to cyclists, potholes resist conventional efforts to detect and repair them. They're widespread and emerge quickly. City Works departments struggle to keep up with them in large part because of the effort needed to detect and catalog potholes. Varun Adibhatla and his colleagues envisioned a scaleable detection system which could measure the entirety of New York City's pothole problem. This would be the first step to solving that problem. Using the Raspberry Pi as an integration platform, the ARGO team was able to rapidly prototype a device which would detect a pothole and automatically take a picture while recording its GPS coordinates. Thus was born the Street Quality Identification Device (SQUID). The prototype device was developed rapidly between April and May of 2015. Secured to an automobile, the ARGO development team was able develop an accurate map of their neighborhood's potholes. They key to this rapid prototyping effort was the Raspberry Pi computer. This device is single board microcomputer not much bigger than a smartphone. This computer is a low cost, bare-bones device which runs a lightweight Debian Linux operating system. This device, while not a personal computer, is an information widget which can manage sensors, control robots, and fit into almost any custom computing and informatics application. Technologies like this can be easily acquired and provides an easily programmable and extensible central processor which can gather data from sensors, organize it, and deposit into municipal information systems which few intermediate steps. As it stands, the SQUID provides an example of alternatives to traditional civic procurement and development cycles.
ARGO has also provided pro-bono assistance to the California Data Collaborative including a novel partnership with Enigma Technologies, a leading civic data startup that powers the world’s largest repository of public data. The tools enabling the integration of that data have been provided pro-bono by Engima to scale the California Data Collaborative’s early success to realize its vision of integrating the entire lifecyle of water data in California.
These case studies provide points of comparison by which development strategies can be evaluated. The USGS and WSWC data exchanges represent traditional Information Technology projects. They are large scale, unitary, top down systems. The WaDE database attempts to solve the entire water data solution at a single blow. While this system holds a great deal of promise, that promise has yet to be realized despite a development time of four years. This lengthy development cycle highlights the weaknesses of such a top down approach. In the fast moving world of Information Technology, four years will see entire technologies rise and fall into obsolescence. Thus, many assumptions in a traditional project may become irrelevant, and, user needs will likely change. This leads to costly delays as the project is redesigned to fit new baselines. Worse yet, such projects tend to build in features which never get used and often fail to meet user needs. This comes despite earnestly developed plans and the hard work of their design and implementation teams. In contrast, the California Data Collaborative, Seed Consulting, ARGO Labs, and the USGS work to solve large problems one piece at a time using agile Methods. These projects offer many lessons on how to create an integrated water data warehouse. The California Data Collaborative and Seed’s Waterlog project have created usable resources in very short periods of time. The USGS Water Information System only seems monolithic at first glance because of the system boundary drawn around large array of interoperable subsystems working together to collect, organize and aggregate data. This system was not developed overnight nor was it created and released whole cloth. Rather it was developed through an iterative processes which refined goals at each step and built on the successes of each previous iteration. This continuous improvement lies at the heart of agile Methodologies. Likewise, the California Data Collaborative and the Waterlog Project both use iterative and agile development methods to make rapid progress which can be demonstrated to stakeholders. This suggests that the data integration mandated by AB 1755 can be undertaken using agile Methodologies which break the overall goal down into smaller tasks and turning out working products which accomplish pieces of the larger goal.
Agile Methodologies rely on open sourcev tools. These include development frameworks, programing languages, software libraries, applications and operating systems. The low cost and large development communities associated with open source systems provide advantages which make it easy to get a major data project off the ground quickly. Not only do these systems have a low cost adoption for new projects, they often work together by design, using standardized data structures such as JSONvi and XMLvii. These standards provide benefits not only to end users of data, but to development teams as well allowing even loosely connected partnerships to work effectively together. The California Data Collaborative’s partnership with Seed Consulting on a new version of Waterlog demonstrates these advantages.
The work of both Seed and the California Data Collaborative demonstrate another essential fact. Those working to integrate data cannot wait for all the agencies involved to come to consensus on data schema, formats, and representations. This is not to say that such consensus cannot be reached. In fact, agencies largely agree on the need for common data structures and protocols and have begun working towards these goals. Many have made significant strides towards this goal already. However, most agencies lack the resources to make profound changes quickly to accommodate the goal of data integration. This means any project which aims to develop an integrated database of water information must take the world as it is rather than as it should be. Currently, at least two state agencies provide data access through a REST interface as does the USGS. These initial efforts lay the foundation for rapid progress towards further integration.
Some agencies, on the other hand, rely on legacy systems and practices difficult to replace or redesign while retaining operational integrity. In the case of the State Water Project and Central Valley Water Project, data is made available through PDF formatted reports containing tabular data. Those working to integrate water system data cannot wait for these agencies to update their practices. The urgency of this is underscored by the need for real time information on reservoir levels and outputs in estimating water supplies. Fortunately there exist a variety of tools for scraping data from web pages and for parsing data contained in human readable formats such as PDF. The practice of Web Scraping extracts information from web site using software tools and can be employed on both the SWP and CVP sites to harvest data and populate an integrated water databases. Many tools for Web Scraping are available. For instance, The Python(viii) programming language excels at this sort of work and has a large software library of third party modules optimized for the job. For processing text data, Python again offers a diverse assortment of capable modules that can power proven tools for integrating public data through agile methodologies such as Enigma’s Parsekit.ix Parsekit is a mature tool which facilitates the entire Extract, Transform, and Load process. Parsekit scales well and allows developers to do more, faster. Tools like Enigma’s Parsekit diminish the obstacles of divergent data formats and protocols.
The primary recommendation is this memo is to work towards data integration using agile Techniques and open source software and systems. However, some specific systems recommendations come first. In terms of specific systems, since California’s Department of Water Resources already runs a Tier 3 Data Center, it makes sense to utilize its existing capacity to host an integrated water database. However, cloud services such was Amazon Web Services or Liquid Web would easily meet any system requirement at a low price. The Postgres Database Management System (DBMS) offers features which make it ideally suited to representing data on water use and supply. It supports the data types needed for both scientific modeling and geospatial analysis. An open source DBMS, Postgres can be acquired and maintained at a low cost. It is also highly customizable enabling developers to modify to handle any requirement. As a fully compliant SQLx database system, Postgres is interoperable with most development frameworks and web servers.
While selecting the proper system components is important, the central recommendation pertains to software practices and culture, in this case, agile Methodologies. agile methods which work in short bursts called Sprints, rapidly planning, building and testing small components of larger projects. The principle working unit of the agile Methodologies is the scrum. A scrum is a cross-functional team of developers, domain experts, analysts and others working in close collaboration to accomplish a discrete and well bounded goal. A project manager or Scrum Master oversees this team, acting to resolve deadlocks and manage conflict. The Scrum Master also gathers data with which to evaluate progress. Planning, review, and retrospective meetings border the sprints on each side. Planning meetings guide the project, assigning personnel, allocating resources, and creating a project plan. At the end of each sprint, a review meeting with the development team and stakeholders evaluates the progress of the sprint. In this meetings, goals are refined and opportunities explored. Each review meeting also helps to build the project’s business case through frequent contact with stakeholders. The review meetings ensure the development team and stakeholders share the same values and priorities. Retrospective meetings build in a reflexive component which forces the team to review their behavior and performance with eye towards improvement and capacity building. These retrospective meetings put development teams on a course of continuous improvement resulting in more productive sprints which accomplish more.
The Dodd Bill stipulates requirements rather than architectures leaving developers latitude to develop specific system architectures. As the overall mission of the bill is to integrate data already being collected and distributed, developers need not stick with conventional database designs. A federated database design seems well suited to the task of integrating data. Federated databases differ from traditional databases in that they represent a coordinating layer which sits atop other databases which are geographically distributed. Businesses commonly use federated database to link regional databases into single data store. A federated database system may include a great number of very heterogeneous systems differing my operating system, vendor, and data communication protocols. The modern federated approach goes beyond a mere networking of data. Systems such as Postgres, MySQL, SQL Server, and Oracle provide sophisticated tools for incorporating external data and making it seem as though it resides in a single database. The data wrappers provided by most systems do more than pass queries from one system to another. They translate data schema from one database to that of another. The federating system will then keep a local cache of records increasing performance, improving reliability, and reducing stress on member systems. This approach promises more than simple speed to implementation. A federated system avoids duplication of effort and avoids switching costs associated with technology as it integrates and complements existing systems rather than replacing or superseding them. Modern federation tools provide another important advantage. Rather than needing to work out any special agreement for data sharing among agencies releasing public data, a federated database can operate over web services protocols like REST and SOAP. For instance, the USGS makes its extensive data on state, regional, and national water systems available to the public, free of charge, using a REST API over the web. Thus, a federated database can simply initiate data exchange over existing and public channels right away. Likewise, other systems not contained in the DWR's Tier 3 Data Center can be integrated with a minimum of planning and development. Inside the data center, existing database systems can be linked rapidly and with little latency in data communication allowing a high performance integrated water database to be implemented rapidly, without disruption to existing services, and without delays due to switching over. For these reasons, using a federated design will minimize costs while providing a very cost effective solution the problems of integration and coordination.
Conclusion and Future Work
Though integrating data from the lifecycle of California’s water may appear daunting, possibilities for rapid progress exists. Small teams employing open source tools and agile Methods have already begun to chip away at the larger problem by breaking it down into smaller pieces and attacking each one in rapid development cycles or sprints. Although agile Methodologies run counter to traditional top-down planning, they are well suited to meet the demands currently placed on governments. Governments face constrained budgets and a public mood which demands greater efficiency and accountability in government projects. Beyond these considerations, the system mandated by the Dodd Bill must contend with the nature of policy development and political processes. Policymakers have a limited attention span and great demands placed on their time. The legislation calls for the creation of a fund to provide for the development of an integrated data system, but does not appropriate money directly. The second, and currently unwritten, part of the Dodd Bill is funding which will require a further case to be made for appropriations. Even if the money comes from the DWR budget, agency officers must be convinced that investments in the project are well spent. For that reason, system developers must pay close attention to both political mood, and policy development processes. The streams of policies, problems, and politics only rarely come together to create windows of opportunities for new programs. These policy windows seldom stay open long as events move forward and legislative priorities shift. These pressures will force system developers to have something to show for their efforts quickly. If not a complete system, then working components of a larger system must be realized while policymakers and agency heads are focused on the problems the Dodd Bill aims to solve. Agile development methodologies excel at developing these sorts of products and in creating efficient and accountable teams.
Two teams in particular have achieved notable success using these practices, The California Data Collaborative and Seed Consulting. Since it started in January, the California Data Collaborative has successfully integrated water use data from many utilities. It has already created tools for program evaluation, policy analysis, and planning. Seed Consulting has, likewise, developed a database which collects information on water supply and quality from a number of DWR and SWRCB sources, aggregating it into single data store, and disseminating it through a REST interface to researchers and water managers. The USGS likewise appears to have used similar approaches in unifying its vast collection of hydrologic data. All three of these projects prove that the integration called for in AB 1755 is feasible and that the benefits of such a project can be rapidly realized. The next step will be to create a development roadmap assessing the state’s current water information infrastructure and planning the integration work itself. Ongoing research with regard to AB 1755 will include a Cost-Benefit Analysis and an examination of how timely data would impact forecasting and planning.
i PDF or Portable Document Format is a file format which presents text documents consistently regardless of operating system, platform, or device.
ii REST is an architecture for data exchange services on the World Wide Web. API stands for Application Program Interface and is the portal through which users may issue data queries and have those requests fulfilled.
iii In this context, Web Services refer to the protocols by which computers transact data transfers on the World Wide Web.
iv The Simple Object Application Protocol (SOAP) is a data exchange protocol on the World Wide Web which allows computer programs to seamlessly exchange data, acting as larger, networked software system.
v open source software is available for free and grants the user the right to review the code and modify it for any purpose. Such software is developed through the collaborative efforts of companies, organization, and individuals. open source software is used in many collaborative efforts.
vii The eXtensible Markup Language provides information on how to handle documents formatted with it. XML is often the underlying format of well-known document types such as Microsoft Word and many web pages.
viii Python is a programming language widely used in science and engineering applications. It has data structures which efficiently handle scientific data. Python has also provide capable in a variety of of settings ranging from Web Applications to Database programming, and Data Analytics.
xi In Computer and Data Science, parsing denotes the act of extracting data from one system and translating it the internal data structures of another.
x SQL or Structured Query Language is the most commonly used database programming language which creates, reads, updates and deletes database records.
Appendix A: California State and Federal Water Data Exchanges