Feasibility and Strategies for Implementing the Open and Transparent Water Data Act (The Dodd Bill)

Executive Summary

This memo examines the feasibility of AB1755, the Open and Transparent Water Data Act, through analysis of the current situation and case studies. Managing California's water resources during this period of extended and severe drought will require innovative policies, technologies, careful planning, and coordination among local, state, and federal agencies. This , in turn, requires detailed information and data on California's supplies of water, and how that water is used in the state. Currently, the provision of this data is fragmented, inconsistent, and is often locked in formats which are incompatible with modern research practices.

AB1755, the Open and Transparent Water Data Act, aka The Dodd Bill, aims to remedy the situation by mandating the construction of an integrated database containing water data. At a minimum, the bill calls for a data portal which makes available data on water supplies in a common, open, and well documented data format. The Dodd Bill calls for the integration of DWR data on groundwater levels, SWRCB data on water quality, operational data from the State Water Project and Central Valley Project, USGS hydrological databases, and data on fish abundance from the California Department of Fish and Wildlife.

Many efforts to provide integrated water data are underway even as the Dodd Bill works its way through the California State Assembly. This memo examines four open data efforts to better understand the challenges of integrating government data and discover best practices. Seed Consul ting's Water Log, the US Geological Survey's Water National Information System, Western States Water Council's Water Data Exchange, and the California Data Collaborative are profiled.

Successful projects have employed agile methodologies and open source systems, and open development frameworks. Agile methods work by breaking apart large problems and jobs into discrete tasks which can be accomplished in short periods of time. Thus, agile methods generate modular products whose success build on one another. For that reason, this memo suggest employing agile methodologies to take on the tasks specified in the Dodd Bill. Further recommendations include using a federated database approach to integrate water data. This approach takes advantage of existing capabilities and systems, making the most efficient use of scarce resources.

Background

The ongoing drought and the forecast of more frequent and persistent droughts throughout the 21st Century has forced the State of California to review its water data management practices so it can most effectively steward California’s scarce water resources. Notably the Western Governor’s Association, the California Council of Science and Technology, the Bay Delta Stewardship Council and others have called for improvements in California’s water data systems.1

Problem Description

While the State of California collects a great deal of information on water use, supply, and rights, these data sets are inaccessible and create unnecessary burdens on staff. Too often such data comes in the form of executive summaries or reports. Policy and academic researchers cannot generally use this sort of data for their work without significant efforts and cost. This constrains efforts to evaluate and develop policy and severely impedes innovation from academia or the private sector. The data format itself often presents problems. For example, the State Water Project and the Central Valley Projects distribute data as PDFi documents. Although humans find such reports easy to read, extracting data for substantive analysis consumes time and resources. Ultimately the effort to extract data is wasteful as it duplicates work already done by state staff. Agencies and offices in the DWR and SWRCB have begun to address this issue, but a great of work needs remains.

Water does not follow agency or jurisdiction boundaries and hydrology is dynamic. Thus, understanding the balance of water supply and discharge requires timely integrated data. Researchers and water system managers should be able to easily know for example the amount of water coming from streams, lakes, reservoirs, and runoff into state water systems. While much of this data does exist within federal and state agencies, it may only be available as a summary table within a cumbersome web form or a PDF formatted report. It might even take a FOIA request to access the data in some instances. Procuring and extracting data in such formats consumes the time and resources of researchers and water managers needlessly. In addition, some researchers interviewed have spent upwards of a year searching for the data and waiting for agencies to package the data and deliver it. The time and effort needed to acquire data hobbles efforts to forecast California's supplies and develop effective conservation programs. The economic costs of these delays include not only the wasted time of professionals researching and managing California's water resources, but the cost of missed opportunities to detect problems and enact effective policies in a timely manner. Again, agencies have made efforts to open their data, most notably with the SWRCB which has created and open data portal and sponsors events aimed at developing new data collaborations. Collaboration with organizations outside of government will bring research and analytical capacities from industry and academia to planning policy processes and provide timely, high quality information to policymakers.

At the present time, many projects to integrate water data are underway, including important water data integration efforts within the California SWRCB and DWR. This memo will consider broader inter-agency water data integration efforts including the SEED Consulting Group’s Waterlog Project, the USGS Water Information System and the Western States Water Council’s Water Data Exchange as examples of ongoing data integration towards what the Dodd bill is calling for. In addition, this memo will examine the model underpinning the early success behind California Data Collaborative’s work with urban water use and synergies with pioneering civic data science nonprofit ARGO Labs.

The Dodd Bill

Assembly Bill 1755, The Open and Transparent Water Data Act calls for integrating the water data collected by state and federal agencies into a common data warehouse accessible by the public, policymakers, and researchers. The legislation calls for integrating DWR data on groundwater levels, SWRCB data on water quality, operational data from the State Water Project and Central Valley Project, USGS hydrological databases, and data on fish abundance from the California Department of Fish and Wildlife. This integration will be accomplished through common data formats and consistent metadata describing the data collected and released. The end goal is to create an accessible and comprehensive data warehouse of the state's water supply and use through common data protocols supporting the creation of innovative visualizations of time series and spatial data. The question now is just how it will be accomplished. Various projects have attempted to solve the data integration problem both inside and outside of government. Each employ different development methodologies and stem from divergent project management philosophies. Their successes and failures can serve as models for implementing AB 1755. More than that, project methodologies influence choices in technology, personnel selection, and scheduling. Thus, choices in methodology change the cost structure of the project and alter its feasibility.

Case Studies

At the present time, many projects to integrate water data are underway, including important water data integration efforts within the California SWRCB and DWR. This memo will consider broader inter-agency water data integration efforts including the SEED Consulting Group’s Waterlog Project, the USGS Water Information System and the Western States Water Council’s Water Data Exchange as examples of ongoing data integration towards what the Dodd bill is calling for. In addition, this memo will examine the model underpinning the early success behind California Data Collaborative’s work with urban water use and synergies with pioneering civic data science nonprofit ARGO Labs.

Seed Consulting and the Waterlog (SeedCG.org)

The Seed Consulting Group is a nonprofit consulting group founded in 2014 to tackle urgent problems in the environment and public health through partnerships between professionals and nonprofit groups. The professionals in Seed volunteer their time to bring expertise and projects dealing with environmental, public health and social problems. One of Seed’s projects is the Waterlog project which aims to aggregate all publicly available water data into a single, easy to use web site providing analytics and data to interested stakeholders working to solve California’s water problems. Seed built on well-known agile methods and employed a Microservice approach. Microservices recognize that the information world moves fast and now uses a wide variety of end user devices and interfaces. Nor is the data back end monolithic any longer. Behind the scenes, data may come from multiple sources run by different organizations. Microservices rest on a Four Tiered Application Model. This model breaks apart the traditional, single structure, view of software and re-envisions the application as an interconnected ecosystem of individual entities working together to deliver services. Under this model, complete data services do not have to be developed by a single entity. Rather, they can emerge through collaborative efforts or even through emergent coordination between independent developers taking advantage of opportunities made possible by the work of others. This concept builds on longstanding programming concepts such as abstraction and is made possible by modern open code libraries, frameworks, and data interchange protocols which provide both common programming tools and standard data formats. Using the Microservice Model, Seed was able to turn a large and loosely defined problem into a number of actionable and feasible goals. Rather than develop a complete solution, Seed focused on data harvesting and aggregation. Their site employs a REST APIii which provides users a gateway to Waterlog data. This architecture enables researchers to pull data on demand into their own databases through widely used software libraries. Seed Consulting’s Waterlog aggregates data from Department of Water Resources web sites. Rather than develop every analytical tool themselves, the Waterlog service encourages researchers to employ their REST API allowing them the freedom to pose their own research questions. Currently Seed Consulting is partnering with the California Data Collaborative to create a new version of their Waterlog project which will include data from additional sources and more advanced visualizations.

The US Geological Survey’s National Water Information System

The US Geological Survey provides another example for collaboration through data with its National Water Information System (NWIS). In most respects, this service is a traditional, monolithic information system. The USGS collects data from its extensive sensor and satellite network. This site provides a great deal about precipitation, groundwater, and stream flows as well as some data on water use reported by state water departments. This USGS also provides a great many analytical tools and graphs. However, the USGS realizes that researchers and analysts need access to raw data in order to conduct research. Thus, their site provides complete access to data in the National Water Information System using Web Servicesiii through a REST API and SOAPiv Web Services. The USGS provides an excellent service for those who need water data, but it only provides data collected by itself and the US EPA through their site. Although, the USGS only collects and distributes data from federal sources, their data can be readily integrated into research thanks to its Web Service and REST interfaces, providing an invaluable service to the researchers, policy analysts, and water managers.

Western States Water Council’s Water Data Exchange

The Western States Water Council (WSWC) has also launched an effort to aggregate water data for planning and management. Their Water Data Exchange (WaDE) is an ambitious effort to coordinate data from eighteen Western states. WaDE was begun in 2012 in response to the fragmentation of data on Western United States Water Supply. The system was designed to create a centralized, easy to access data repository of water data in a common format. WaDE, currently in Beta test, provides access to data on Water Use from many different states and federal agencies including the US Geological Survey (USGS) and California Department of Water Resources. WaDE provides data through a Restful Web Services interface which employees an XML vocabulary called WaDE XML.

The developers of WaDE made several decisions about design based on the political and organizational realities of a cooperative data project. The first decision was to make the platform open and decentralized. Under this scheme, each state operates a node containing data using their own infrastructure and format. Web Services link each node to the central WaDE repository using a combination of REST queries and Simple Object Access Protocol (SOAP) services. While each state’s node operates according to its own policy internally, it must respond to Query URLs from the WaDE client to pull data. The program specifies a small set of distinct services which allow for authentication, sharing of data catalogs, and on-demand retrieval of data. Each node shares catalogs of data following protocols resembling the Open Archives Initiative’s Protocol for Metadata Harvesting. This process synchronizes data catalogs to provide a central inventory of available data for the exchange. The project is also developing not only an internal data schema which organizes data, but an XML based markup language for data exchange. WaDE provides access to its collected data through a portal based on the REST protocol. This provides a very flexible interface for researchers of all kinds. Rather than impose any any sort of analytical constraints, WaDE provides the data and it attendant metadata to users through individually crafted REST queries. The developers of WaDE understand that their clients have many needs and would be hard pressed to cater to them all. To make the best of use of their limited resources, the developers decided to provide an open data gateway using REST. The Beta version of WaDE currently includes GIS applications using ARC GIS as well as a web-form based query tool. However, at this point in time, the project is not complete and only aggregates summary data provided by member states. At this point in time, the WaDE remains in development and can only be accessed with an account and password.

California Data Collaborative (CaliforniaDataCollaborative.com)

The California Data Collaborative was founded in January of 2016 as a coalition of municipal water utilities serving more than 3.7 million Californians. The California Data Collaborative’s work has already been honored by the White House as part of its March 2016 Water Summit and yielded impressive results in the form of new analytical tools, policy evaluations, and economic models. The California Data Collaborative achieved its results through the use of agile Methodologies which focus on incremental progress rather than creating a complete solution all at once. agile practices recognize and internalize the reality that projects change as they progress and emphasize frequent contact with stakeholders to ensure that current needs are met. Utilizing these practices, the Data Collaborative was able to divide a large and challenging project up into a set of smaller and easy to accomplish goals. Through its agile approach, the project was able to exploit emergent opportunities incorporate them into the larger whole on the fly. More importantly, by meeting frequently with utility managers who comprised the stakeholder community, the California Data Collaborative was able to build a strong business case for its work and ensure that the products met the needs of its users.

The California Data Collaborative breaks away from traditional modes of documentation (reports, spreadsheets, tables) and focuses on providing data for analysis and effective visualizations using open source tools. In contrast with the practices of the State Water Project's documentation of operations, the California Data will make its data available through Postgres SQL database accessed through a secure web interface supporting queries and providing visualization tools for registered users. As a great deal of data in this system is confidential, access will be limited to participating utilities and researchers using strong security to prevent breaches in privacy. This will let researchers quickly obtain what they need without the added effort of wading through multiple documents and reports searching for the right data only to spend further time extracting it and loading it into their own systems. The California Data Collaborative's SCUBA database also standardizes data from multiple sources into a consistent data structure which facilitates comparative research and evaluation.

ARGO Labs

Advanced Research in Government Operations Laboratories or ARGO Lab for short, is a leading civic data science nonprofit that has been featured by Fast CoExist, the Fox News Smart Cities initiative, and a half dozen newspapers in New York where it is headquartered. This organization uniquely brings together civic data science and governmental operational expertise together to improve public service delivery. Cities typically have a lengthy procurement cycle which solicits bids and develops specifications which tend to assume each technology component must be individually developed. The bid process can also be quite expensive, and once a bid has been accepted, it may take two years or longer for a project to reach production. In business, two years encompasses and entire technological cycle in which product and entire systems become mature and become obsolete. In contrast to this model, ARGO's team utilized widely available commercial off the shelf technologies to build a project platform and develop a prototype product. This example demonstrates the power of an agile approach which employs interoperable and open technologies.

Consider an age old problem plaguing New York and most other cities: the pothole. Causing damage to cars and injury to cyclists, potholes resist conventional efforts to detect and repair them. They're widespread and emerge quickly. City Works departments struggle to keep up with them in large part because of the effort needed to detect and catalog potholes. Varun Adibhatla and his colleagues envisioned a scaleable detection system which could measure the entirety of New York City's pothole problem. This would be the first step to solving that problem. Using the Raspberry Pi as an integration platform, the ARGO team was able to rapidly prototype a device which would detect a pothole and automatically take a picture while recording its GPS coordinates. Thus was born the Street Quality Identification Device (SQUID). The prototype device was developed rapidly between April and May of 2015. Secured to an automobile, the ARGO development team was able develop an accurate map of their neighborhood's potholes. They key to this rapid prototyping effort was the Raspberry Pi computer. This device is single board microcomputer not much bigger than a smartphone. This computer is a low cost, bare-bones device which runs a lightweight Debian Linux operating system. This device, while not a personal computer, is an information widget which can manage sensors, control robots, and fit into almost any custom computing and informatics application. Technologies like this can be easily acquired and provides an easily programmable and extensible central processor which can gather data from sensors, organize it, and deposit into municipal information systems which few intermediate steps. As it stands, the SQUID provides an example of alternatives to traditional civic procurement and development cycles.

ARGO has also provided pro-bono assistance to the California Data Collaborative including a novel partnership with Enigma Technologies, a leading civic data startup that powers the world’s largest repository of public data. The tools enabling the integration of that data have been provided pro-bono by Engima to scale the California Data Collaborative’s early success to realize its vision of integrating the entire lifecyle of water data in California.

Recommendations

These case studies provide points of comparison by which development strategies can be evaluated. The USGS and WSWC data exchanges represent traditional Information Technology projects. They are large scale, unitary, top down systems. The WaDE database attempts to solve the entire water data solution at a single blow. While this system holds a great deal of promise, that promise has yet to be realized despite a development time of four years. This lengthy development cycle highlights the weaknesses of such a top down approach. In the fast moving world of Information Technology, four years will see entire technologies rise and fall into obsolescence. Thus, many assumptions in a traditional project may become irrelevant, and, user needs will likely change. This leads to costly delays as the project is redesigned to fit new baselines. Worse yet, such projects tend to build in features which never get used and often fail to meet user needs. This comes despite earnestly developed plans and the hard work of their design and implementation teams. In contrast, the California Data Collaborative, Seed Consulting, ARGO Labs, and the USGS work to solve large problems one piece at a time using agile Methods. These projects offer many lessons on how to create an integrated water data warehouse. The California Data Collaborative and Seed’s Waterlog project have created usable resources in very short periods of time. The USGS Water Information System only seems monolithic at first glance because of the system boundary drawn around large array of interoperable subsystems working together to collect, organize and aggregate data. This system was not developed overnight nor was it created and released whole cloth. Rather it was developed through an iterative processes which refined goals at each step and built on the successes of each previous iteration. This continuous improvement lies at the heart of agile Methodologies. Likewise, the California Data Collaborative and the Waterlog Project both use iterative and agile development methods to make rapid progress which can be demonstrated to stakeholders. This suggests that the data integration mandated by AB 1755 can be undertaken using agile Methodologies which break the overall goal down into smaller tasks and turning out working products which accomplish pieces of the larger goal.

Agile Methodologies rely on open sourcev tools. These include development frameworks, programing languages, software libraries, applications and operating systems. The low cost and large development communities associated with open source systems provide advantages which make it easy to get a major data project off the ground quickly. Not only do these systems have a low cost adoption for new projects, they often work together by design, using standardized data structures such as JSONvi and XMLvii. These standards provide benefits not only to end users of data, but to development teams as well allowing even loosely connected partnerships to work effectively together. The California Data Collaborative’s partnership with Seed Consulting on a new version of Waterlog demonstrates these advantages.

The work of both Seed and the California Data Collaborative demonstrate another essential fact. Those working to integrate data cannot wait for all the agencies involved to come to consensus on data schema, formats, and representations. This is not to say that such consensus cannot be reached. In fact, agencies largely agree on the need for common data structures and protocols and have begun working towards these goals. Many have made significant strides towards this goal already. However, most agencies lack the resources to make profound changes quickly to accommodate the goal of data integration. This means any project which aims to develop an integrated database of water information must take the world as it is rather than as it should be. Currently, at least two state agencies provide data access through a REST interface as does the USGS. These initial efforts lay the foundation for rapid progress towards further integration.

Some agencies, on the other hand, rely on legacy systems and practices difficult to replace or redesign while retaining operational integrity. In the case of the State Water Project and Central Valley Water Project, data is made available through PDF formatted reports containing tabular data. Those working to integrate water system data cannot wait for these agencies to update their practices. The urgency of this is underscored by the need for real time information on reservoir levels and outputs in estimating water supplies. Fortunately there exist a variety of tools for scraping data from web pages and for parsing data contained in human readable formats such as PDF. The practice of Web Scraping extracts information from web site using software tools and can be employed on both the SWP and CVP sites to harvest data and populate an integrated water databases. Many tools for Web Scraping are available. For instance, The Python(viii) programming language excels at this sort of work and has a large software library of third party modules optimized for the job. For processing text data, Python again offers a diverse assortment of capable modules that can power proven tools for integrating public data through agile methodologies such as Enigma’s Parsekit.ix Parsekit is a mature tool which facilitates the entire Extract, Transform, and Load process. Parsekit scales well and allows developers to do more, faster. Tools like Enigma’s Parsekit diminish the obstacles of divergent data formats and protocols.

The primary recommendation is this memo is to work towards data integration using agile Techniques and open source software and systems. However, some specific systems recommendations come first. In terms of specific systems, since California’s Department of Water Resources already runs a Tier 3 Data Center, it makes sense to utilize its existing capacity to host an integrated water database. However, cloud services such was Amazon Web Services or Liquid Web would easily meet any system requirement at a low price. The Postgres Database Management System (DBMS) offers features which make it ideally suited to representing data on water use and supply. It supports the data types needed for both scientific modeling and geospatial analysis. An open source DBMS, Postgres can be acquired and maintained at a low cost. It is also highly customizable enabling developers to modify to handle any requirement. As a fully compliant SQLx database system, Postgres is interoperable with most development frameworks and web servers.

While selecting the proper system components is important, the central recommendation pertains to software practices and culture, in this case, agile Methodologies. agile methods which work in short bursts called Sprints, rapidly planning, building and testing small components of larger projects. The principle working unit of the agile Methodologies is the scrum. A scrum is a cross-functional team of developers, domain experts, analysts and others working in close collaboration to accomplish a discrete and well bounded goal. A project manager or Scrum Master oversees this team, acting to resolve deadlocks and manage conflict. The Scrum Master also gathers data with which to evaluate progress. Planning, review, and retrospective meetings border the sprints on each side. Planning meetings guide the project, assigning personnel, allocating resources, and creating a project plan. At the end of each sprint, a review meeting with the development team and stakeholders evaluates the progress of the sprint. In this meetings, goals are refined and opportunities explored. Each review meeting also helps to build the project’s business case through frequent contact with stakeholders. The review meetings ensure the development team and stakeholders share the same values and priorities. Retrospective meetings build in a reflexive component which forces the team to review their behavior and performance with eye towards improvement and capacity building. These retrospective meetings put development teams on a course of continuous improvement resulting in more productive sprints which accomplish more.

The Dodd Bill stipulates requirements rather than architectures leaving developers latitude to develop specific system architectures. As the overall mission of the bill is to integrate data already being collected and distributed, developers need not stick with conventional database designs. A federated database design seems well suited to the task of integrating data. Federated databases differ from traditional databases in that they represent a coordinating layer which sits atop other databases which are geographically distributed. Businesses commonly use federated database to link regional databases into single data store. A federated database system may include a great number of very heterogeneous systems differing my operating system, vendor, and data communication protocols. The modern federated approach goes beyond a mere networking of data. Systems such as Postgres, MySQL, SQL Server, and Oracle provide sophisticated tools for incorporating external data and making it seem as though it resides in a single database. The data wrappers provided by most systems do more than pass queries from one system to another. They translate data schema from one database to that of another. The federating system will then keep a local cache of records increasing performance, improving reliability, and reducing stress on member systems. This approach promises more than simple speed to implementation. A federated system avoids duplication of effort and avoids switching costs associated with technology as it integrates and complements existing systems rather than replacing or superseding them. Modern federation tools provide another important advantage. Rather than needing to work out any special agreement for data sharing among agencies releasing public data, a federated database can operate over web services protocols like REST and SOAP. For instance, the USGS makes its extensive data on state, regional, and national water systems available to the public, free of charge, using a REST API over the web. Thus, a federated database can simply initiate data exchange over existing and public channels right away. Likewise, other systems not contained in the DWR's Tier 3 Data Center can be integrated with a minimum of planning and development. Inside the data center, existing database systems can be linked rapidly and with little latency in data communication allowing a high performance integrated water database to be implemented rapidly, without disruption to existing services, and without delays due to switching over. For these reasons, using a federated design will minimize costs while providing a very cost effective solution the problems of integration and coordination.

Conclusion and Future Work

Though integrating data from the lifecycle of California’s water may appear daunting, possibilities for rapid progress exists. Small teams employing open source tools and agile Methods have already begun to chip away at the larger problem by breaking it down into smaller pieces and attacking each one in rapid development cycles or sprints. Although agile Methodologies run counter to traditional top-down planning, they are well suited to meet the demands currently placed on governments. Governments face constrained budgets and a public mood which demands greater efficiency and accountability in government projects. Beyond these considerations, the system mandated by the Dodd Bill must contend with the nature of policy development and political processes. Policymakers have a limited attention span and great demands placed on their time. The legislation calls for the creation of a fund to provide for the development of an integrated data system, but does not appropriate money directly. The second, and currently unwritten, part of the Dodd Bill is funding which will require a further case to be made for appropriations. Even if the money comes from the DWR budget, agency officers must be convinced that investments in the project are well spent. For that reason, system developers must pay close attention to both political mood, and policy development processes. The streams of policies, problems, and politics only rarely come together to create windows of opportunities for new programs. These policy windows seldom stay open long as events move forward and legislative priorities shift. These pressures will force system developers to have something to show for their efforts quickly. If not a complete system, then working components of a larger system must be realized while policymakers and agency heads are focused on the problems the Dodd Bill aims to solve. Agile development methodologies excel at developing these sorts of products and in creating efficient and accountable teams.

Two teams in particular have achieved notable success using these practices, The California Data Collaborative and Seed Consulting. Since it started in January, the California Data Collaborative has successfully integrated water use data from many utilities. It has already created tools for program evaluation, policy analysis, and planning. Seed Consulting has, likewise, developed a database which collects information on water supply and quality from a number of DWR and SWRCB sources, aggregating it into single data store, and disseminating it through a REST interface to researchers and water managers. The USGS likewise appears to have used similar approaches in unifying its vast collection of hydrologic data. All three of these projects prove that the integration called for in AB 1755 is feasible and that the benefits of such a project can be rapidly realized. The next step will be to create a development roadmap assessing the state’s current water information infrastructure and planning the integration work itself. Ongoing research with regard to AB 1755 will include a Cost-Benefit Analysis and an examination of how timely data would impact forecasting and planning.

Notes

1 http://www.argolabs.org/blog-1/2015/11/30/integrated-water-data-infrastructure-an-idea-whose-time-has-come

i PDF or Portable Document Format is a file format which presents text documents consistently regardless of operating system, platform, or device.

ii REST is an architecture for data exchange services on the World Wide Web. API stands for Application Program Interface and is the portal through which users may issue data queries and have those requests fulfilled.

iii In this context, Web Services refer to the protocols by which computers transact data transfers on the World Wide Web.

iv The Simple Object Application Protocol (SOAP) is a data exchange protocol on the World Wide Web which allows computer programs to seamlessly exchange data, acting as larger, networked software system.

v open source software is available for free and grants the user the right to review the code and modify it for any purpose. Such software is developed through the collaborative efforts of companies, organization, and individuals. open source software is used in many collaborative efforts.

vi The JavaScript Object Notation provides the underlying data structure for many machine to machine data transactions on the World Wide Web. JSON is a widely used data representation standard which provides interoperability between disparate software systems.  JSON or JavaScript Object Notation record structure which allows computer programs to share information using the World Wide Web seamlessly as though they were part of the same software system.

vii The eXtensible Markup Language provides information on how to handle documents formatted with it. XML is often the underlying format of well-known document types such as Microsoft Word and many web pages.

viii Python is a programming language widely used in science and engineering applications. It has data structures which efficiently handle scientific data. Python has also provide capable in a variety of of settings ranging from Web Applications to Database programming, and Data Analytics.

xi In Computer and Data Science, parsing denotes the act of extracting data from one system and translating it the internal data structures of another.

x SQL or Structured Query Language is the most commonly used database programming language which creates, reads, updates and deletes database records.

 

Appendix A: California State and Federal Water Data Exchanges

 

Database Name

Owner/Sponsoring Agency

Type

Description

State Water Project Reservoir Operation

State Water Project

Reservoir and Aqueduct

Operations and flow data available in PDF format. They will share data in other formats by request, They prefer Excel.

California Environmental Data Exchange Network(CEDEN)

SWRCB

Water Quality

An integrated water data quality database bringing together data collected at regional centers and from 3rd party partners.

California State Groundwater Elevation Map (CASGEM)

DWR

Ground Water

Online portal with more detailed data than the water data library. Shows charts of groundwater elevation over time. Data is organized by County, Basin, Monitoring Entity, or well type as CSV files.

California Water Data Library

DWR

Water Quality and Flow Data

A data library containing output of all continuous and discrete data. The library is not documented, making it difficult to know exactly what data it contains.

Water Information System

MWD

Urban Water Use

California Data Exchange Center (CDEC)

California DWR

Rivers and Dams

This database provides information on river flow and dam output in a variety of formats. It is primarily a scriptable CGI using the GET method. This database has information on river conditions, snow melt run off, snow pack, statewide river conditions, and water allocations. Also has detailed forecasting information.

Central Valley Project

US Bureau of Land Reclamation

Dams and Reservoirs

Operations data for Federal projects in California like the CVP in PDF format.

State Department of Fish and Wildlife Fish Abundance

California Department of Fish and Game

Fish Abundance

This is a straight, HTML table report of fish abundance by type. The reports go back 1967 and extend up through last year. Will need to know methodology to get a better idea of how current their internal data is. They may not release 2016 data until they have finished analyzing and collecting all the year’s data. This will be made clear in interviews.

US Department of Fish and Wildlife BIOGEOGRAPHIC INFORMATION & OBSERVATION SYSTEM

US Dept. of Fish and Wild Life

Fish Abundance and others

This is a comprehensive database of GIS and time series data on factors relevant to fish and wild life. Very carefully maintained and and well organized with clear metadata. Note: California Departmentt of Fish and Game originates this data and the USFWS collects and archives it. So, this essentially duplicates the data provided by the state but it might be the designated repository.

NOAA Fish Abundance

NOAA

Ecosystem Assessment

Assessments based on data California Department of Fish and Game reports.

USGS Water Information System

USGS

Surface Water, Ground Water, Stream Levels, Land Use, Irrigation, and Biome Data.

This searchable database provides a great variety of water data on streams, lakes, reservoirs, land use, precipitation and biome data. The site provides a set of graphical and analytic tools and disseminates data through a fully documented REST and Web Services API.

California Irrigation Management Information System (CIMIS)

DWR

Irrigation Water Use

This data contains detailed data on evapotranspiration of irrigated land. Data is made available through a Windows application and through a REST API.

Appendix B: Summary of Interviews

List of Interview of questions

From May through June 26, the managers of California's water data providers were contacted for interviews on their current practices to assess current practices. The questions asked in the interviews are listed below. No special recording or annotation software were employed.

  1. What are the databases types (SQL, PostgreSQL, etc.) underlying you data portal?

  2.  

    1. What servers does your organization use?

    2. Are they on site or in the cloud?

  3.  

    1. Who directly does the DBA work for your organization?

    2. How many FTEs directly administer the databases?

  4. What is the budget for your organization's data publishing efforts?

  5. What development practices do you use to manage your data store?

  6. Is your data open to the public?

  7.  

    1. In terms of the open data plan, is there a RESTful API or similar direct access scheme planned for the future?

    2. Why or why not?

  8.  

    1. Do you have a database schema you can share?

    2. Do you have an ingest schema you can share?

Summary of Interview Responses

Greg Smith of California DWR California Groundwater Elevation Monitoring (CASGEM), 6/16/2016

CASGEM makes available the data gathered by the state's groundwater monitoring system. They use Oracle and SQL Server as they database platforms and their servers are hosted on site in the DWR's Tier 3 Data Center. They're DBA work is performed by the DWR's IT Division. He does not the budget for the hosting and staffing for the databases as that comes from the DWR's general fund. Nor does he know the number of FTEs needed to support the project; that's handled by the DWR. CASGEM uses Agile practices to develop under .Net and Java. CASGEM data is open to the public through a web interface. There are no plans to develop a REST interface due to a lack of resources.

Jeremy McHugh, US Geological Survey, National Water Information System, 6/16/2016

The USGS used Oracle for its database platform and hosting is done both on-site at the California Science Center's Tier 3 Data Center and off site through Amazon Web Services. Development is done in house and the USGS does not contract out. The budget is around $840,000. Jeremy did not know what development practices were used, but believed it varied by project. The data is open to the public and is made available through REST and SOAP Web Services.

Bekele Temesgen, California DWR, California Irrigation Management Information System (CIMIS), 06/22/2016

CIMIS uses Oracle, GRASS GIS, and SQL Lite embedded in its web application to store and query data. Hosting is done on site through the DWR's Tier 3 Data Center. The DWR also provides DBA services. CIMIS employees one full time staff member to develop the web applications. They tend to use a traditional Waterfall development methodology, though they also utilize agile practices for some projects. They data is available to the public.

Jarma Bennett, State Water Resources Control Board, California Environmental Data Exchange Network (CEDEN)

CEDEN hosts their data using SQL Server. Their database is hosted at the DWR's Tier 3 Data Center and is supported by DWR's IT Division. Their development employees one DWR IT Division employee working ¼ time and the SWRCB Technology Director working ¾ time. CEDEN will also contract out for application development projects. Jarma did not know what development methodologies were employed. The annual budget for CEDEN is $850,000. They are currently working on developing a REST web interface.