The CaDC Efficiency Explorer is a planning and education tool for local water managers and the wider California water community. This tool is the result of a rapid first assessment of Governor Brown's Executive Order B-37-16, which calls for the development of water use targets customized to the unique conditions of each urban water agency as part of a new, permanent efficiency framework.
For version one of the Efficiency Explorer, the CaDC focused on the residential component of an agency's efficiency target. This target is calculated as the sum of indoor and outdoor residential production budgets, and can vary according to unique agency conditions as well as pending policy decisions. The parameters of these two budgets are displayed in the labeled panels below.
Data quality is an important dimension for this rapid first assessment, and this post aims to elaborate the data quality concerns that not only arose here, but also those that any subsequent target calculation effort must contend with.
Distinct Types of Error
There are two distinct senses in which efficiency target calculations can deviate from ground truth: precision and accuracy.
Parameter data used to calculate targets can be imprecise. Imprecision reflects statistical deviations around a true value. The Efficiency Explorer's graphs include gray confidence bands around each agency's calculated target to indicate the imprecision resulting from the compounded statistical error for all parameter data sources. Analogous to the relationship between the darts and the bullseye in figure (a) above, one should expect the ground truth efficiency target values to lie somewhere within the confidence bands (for agencies not flagged as showing evidence of systematic inaccuracy).
While this type of error is important to understand, it is not the focus of this post. For technical details on each component error source, please see our error model. Takeaway: while imprecise target calculations can be further refined, they are useful first approximations of ground truth.
As alluded to above, in certain situations parameter data used to calculate targets can be not only imprecise, but also inaccurate. Inaccuracy reflects a more systematic bias away from ground truth. Figure (b) above graphically illustrates this type of error. Non-random inaccuracies can arise from situations such as the prevalence of large rural residential parcels in certain districts, which would result in systematic overestimation of target calculations in those districts. The prevalence of brown lawns in other districts would result in systematic underestimation of target calculations in those districts. Targets flagged as systematically inaccurate should not be interpreted as useful approximations and have therefore been grayed out on the Efficiency Explorer map.
Roughly 20% of agency target calculations show evidence of this more problematic source of error. The focus of this post is to break down the component data quality concerns leading to systematic inaccuracies.
Data Quality Concerns
There were five data quality concerns identified for our target calculations. These five concerns were evaluated on an agency-by-agency basis. The five concerns are as follows:
- CIMIS Proximity
- Rural Residential Prevalence
- Residential Parcel Accuracy
- Census Place Coverage
- Service Boundary
Let’s explore each in detail.
Suppliers with greater environmental demands for evaporation and plant transpiration will feel a greater demand for outdoor water production. Reference evapotranspiration captures this demand and scales the outdoor budget accordingly. To measure reference evapotranspiration, data from the Department of Water Resources' CIMIS stations are used to calculate inverse distance-weighted averages of nearby station readings for each supplier as a function of time. However, not all suppliers are within close proximity to CIMIS stations. Suppliers with no CIMIS stations within 20 kilometers, or with obvious intervening obstructions such as mountain ranges, have been flagged for data quality concerns.
Higher resolution ET data is foundational to being able to accurately assign parcel level ET and thus calculate an accurate, parcel customized water budget. The Department of Water Resources' spatial CIMIS data can provide more robust evapotranspiration measurements at a higher resolution.
Importantly though, according to the latest spatial CIMIS methodology, that measurement incorporates only solar radiation in addition to local in situ sensor measurements. To help improve on those measurements CaDC staff has partnered with researchers at UCLA and NYU CUSP to integrate additional pertinent publicly available data sources such as wind, precipitation and physiography. The latest work on that project is open source (like all of CaDC's projects) and available on the CaDC GitHub. In addition, Moulton Niguel Water District, a founding participant in the CaDC, has developed an applied R&D partnership with Jet Propulsion Laboratory to improve the spatial granularity of evapotranspiration estimates.
Rural Residential Prevalence
Suppliers with more landscaped area may have greater outdoor residential water production demands. Their outdoor residential production budgets are scaled accordingly.
There are ongoing discussions to finalize a definition of landscaped area appropriate for setting an outdoor water production standard. In our calculations, landscaped area is defined as the sum of photosynthetically active turf, and bushes and trees. These data were collected through remote-sensing in partnership with Claremont Graduate University.
While the landscapes of typical suburbs without brown lawns are captured well with the photosynthetically active remote sensing approach used, there generally exists greater data quality uncertainty for rural and wooded areas.
For these reasons, suppliers with a significant percentage of large, green, and un-irrigated residential parcels have been flagged for data quality concerns.
There seems to be an important space for human experts to manually inspect ambiguous areas to reconcile what specific category the area falls into. However, large-scale remote sensing has proved invaluable for addressing water efficiency in a way that can scale statewide. Even in areas where additional work must be done, remote sensing can be used to flag these challenging areas that need that special attention, allowing experts to operate more clinically.
Residential Parcel Accuracy
Administrative boundary data were required to demarcate residential parcels from water using parcels designated for other land uses. For this iteration, residential parcel data were obtained from the Office of Planning and Research's residential parcel dataset. However, administrative boundary data must be deliberately validated.
If parcel data suggest that there is less than one person per residential parcel in a district, that district's residential landscaped area will likely be systematically overestimated. Conversely, if parcel data indicate that there are too many people per residential parcel, landscaped are measurements will be systematically underestimated. For these reasons, suppliers with an unreasonable average number of people per residential parcel have been flagged for data quality concerns.
OPR’s statewide residential parcel data offers a great starting point that can be built upon. Subsequent target calculations will need to improve parcel data to better align landscape area with customer type.
An approach enabled by CaDC's data sharing model is to match meters to parcels to highlight areas with maintained landscapes that use retail water in a non-manual (read: scalable) way. The current CaDC member agencies--14 as of this publication--have already made the pioneering investment in common data infrastructure to integrate that metered use and parcel data on a voluntary basis.
Census Place Coverage
Census Designated Places were used by Claremont Graduate University to focus computational resources used for landscaped area remote sensing calculations. If a large percentage of a service area's residential parcels are outside of a designated census place, landscaped area estimates will be off will this approach.
The Census Designated Places acted as a filter to focus limited computational resources on imagery tiles most likely to have people. This was an artifact of achieving statewide landscape area estimates for 4% of the State's budget in this initial iteration. In future iterations, this quality concern should be relatively straightforwardly surmountable.
With this in mind, it should be made explicit that whichever tools and data California uses to carry out and explore statewide target calculations should be agnostic to the landscape area data source. Following this design principle, CaDC staff ensured the digital infrastructure built for this pragmatic first approximation of statewide efficiency targets was robust and general enough to smoothly receive iterative refinements of source data for all budget parameters.
Administrative boundary data were also required to demarcate supplier service areas in order to appropriately allocate landscaped area. For this iteration, data were obtained from the Department of Water Resources' Water Management Planning Tool as well as the California Environmental Health Tracking Program's Water System Map Viewer.
A service area’s boundary data can have overlap with neighboring boundaries introducing data quality concern.
Service area boundaries remain an important area for improvement. The California Environmental Health Tracking Program's water system boundary data needed additional processing and selective supplanting using Department of Water Resources's data before it could be used. Encouragingly, the Health Tracking Program seems to have the digital infrastructure in place for local water system personnel-driven iterative improvements toward an authoritative service area boundary dataset.
The five concerns above were combined into an aggregate data quality assessment for each agency's target calculation. For the agency-level evaluations see our agency-by-agency breakdown, and for a more general look at technical foundation of this work please see the CaDC Statewide Efficiency Explorer Methodology v 1.1.
While the CaDC's rapid first assessment of Governor Brown's long term efficiency framework was the proximate cause of this exploration of the data quality landscape, it is important to hold in mind the concerns (and their suggested remedies) apply to any subsequent target calculation effort.
The CaDC is currently scoping a version 2.0. This will expand the residential efficiency exploration to commercial, industrial and institutional water use pioneered by two now CaDC data scientists when at NYU CUSP in a research group led by Constantine Kontokosta. This integrated suite of version 2.0 tools will also look at supply in addition to demand. Stay tuned for more information!
Implementing this framework will inevitably require coordinating disparate date sources maintained by disparate institutions. By working together smartly and collaboratively, together we can ensure water reliability no matter what the future holds. We invite all California water utilities to join the winning team and participate in the CaDC today!
CaDC staff is proud to announce the completion of its first rapid assessment of the prospective residential component of statewide efficiency goals described in the implementation framework for Governor Brown’s Executive Order B-37-16. This integrates publicly available evapotranspiration, land use, service area boundary, aerial imagery, population and water production data to estimate residential water efficiency goals for 404 out of CA's 409 major urban water retailers reporting in the latest supplier report. The assessment offers water suppliers a first look at water use compared to a residential efficiency goals and illustrates the need for enhanced data sources and additional information.
Supporting the CA water community in planning for the future
This assessment provides a marked improvement over the previous CaDC parcel based methodology, which was the previously best available statewide approximation publicly available online. Those calculations are shown via an interactive tool whereby water uses can input and analyze various policy scenarios. This tool was developed for planning and education purposes as a public service to support the water community in navigating the rapidly evolving statewide policy discussions.
As described in the original grant agreement with the Water Foundation, "This interactive planning tool empowers the California water community to analyze the impact of those prospective efficiency standards under user selected scenarios with varying indoor or outdoor efficiency standards." The CaDC partnership does not take water policy positions as described in the CaDC in depth principles here. The tool also illustrates the requirement for additional accuracy in landscape, population, land use, and weather data as part of an integrated approach for improving these estimated goals in version 2.0.
The open source CaDC efficiency explorer tool is described in greater detail in the statewide efficiency section here. The underlying open source code is available here and that interactive tool leverages ARGO nonprofit public data infrastructure to provide the ability to iteratively improve this initial rapid assessment.
CaDC staff would like to thank the Water Foundation for generously funding this work, Claremont Graduate University for developing landscape area data, the CaDC local utility technical working group for their invaluable insight and CaDC academic partners for their review. The water policy-neutral methodology developed in collaboration with those partners and utilized to estimate residential efficiency goals is available below. (UPDATED see here to download a PDF of the version 1.1 methodology that is also shown below. Also, please see here for a one page statement summarizing the uses of the tool.)
This methodology documentation made great effort to highlight future opportunities for improvement as this endeavor is a rapid first assessment, not a final definitive result. CaDC staff has quantified the expected utility level error statistically and included error bars on the residential goals shown.
That statistical error calculation is available in depth here. Furthermore, CaDC staff is qualitatively analyzing the unique local circumstances that can lead to data quality challenges for all 404 agencies. Those factors include the CIMIS station proximity, administrative area boundary issues, rural residential parcels, and prevalence of local factors making remote sensing difficult.
In addition, CaDC staff is collaborating with the state, the CaDC coalition of local water utilities and the CaDC network of academic, technology and nonprofit partners on improving this underlying land use, service area boundary, landscape area and evapotranspiration data utilized for this initial assessment. The CaDC welcomes the participation of other water suppliers in the coalition to aid in improving data accuracy and improving this tool and other CaDC analytics. Get in touch here to join and stay tuned for future updates!
In the interim, note feedback and suggestions for improvements in future iterations are appreciated! Please leave your questions and ideas in the comments section below.
UPDATE 6-5-17: Based on CaDC technical working group feedback, the following section has been added to the efficiency explorer tool to provide important data quality considerations. There are two distinct senses in which efficiency goals calculations can deviate from ground truth: precision and accuracy.
Parameter data used to calculate goals can be imprecise. Imprecision reflects deviations around a true value. The Efficiency Explorer's graphs include gray confidence bands around each agency's calculated goal to indicate the imprecision resulting from the compounded statistical error for all parameter data sources. Analogous to the relationship between the darts and the bullseye in figure (a) above, one should expect the ground truth efficiency goal values to lie somewhere within the confidence bands (for agencies not flagged as showing evidence of systematic bias away from accuracy). Imprecise goal calculations are good initial estimates of ground truth, though ones that can be further refined.
As alluded to above, in certain situations parameter data used to calculate goals can be not only imprecise, but also inaccurate. Inaccuracy reflects a more systematic bias away from ground truth. Figure (b) above graphically illustrates this type of error. Non-random inaccuracies can arise from situations such as the prevalence of large rural residential parcels in certain districts, which would result in systematic overestimation of goal calculations in those districts. The prevalence of brown lawns in other districts would result in systematic underestimation of goal calculations in those districts. These types of data quality uncertainties will be elaborated in an upcoming CaDC blog post. Goals flagged as systemically biased away from ground truth have been grayed out on the map and should be interpreted as being potentially inaccurate.
The efficiency explorer methodology linked above has been updated (now version 1.1) to reflect this nuance.
UPDATE 6-23-17: The CaDC efficiency explorer tool has added the following qualifying statement in bold at the landing page for the tool, in addition to a splash screen with additional data quality context prior to utilizing the tool:
"The Efficiency Explorer Tool was developed with publicly available data to offer water managers a first glance at water use compared to potential water efficiency goals. It is for educational and illustrative purposes only. The Efficiency Explorer Tool was not intended and is not able to calculate water agency budgets at a level of accuracy appropriate for establishing policy. Several areas for improvement were identified as this tool was developed and the CaDC is dedicated to working with members and stakeholders to improve the accuracy and precision of this tool."
Please see here for a one page statement summarizing the uses of the tool. The CaDC has also detailed additional data quality considerations with statewide efficiency goal setting.
UPDATE 8-10-17: We have identified a methodological improvement to account for edge cases in our assignment of residential parcels to utilities.
Since our process for determining residential landscaped area involves joining separate parcel datasets—one with landscaped area measurements and one with a land use classification—we include a step to filter out duplicate records in both of these datasets before joining to avoid any possible administrative data quality issues. To achieve this, one can filter on distinct combinations of APN and county, or distinct combinations of APN and a unique supplier identifier; and in most cases the results will be identical.
However, there do exist cases where parcels are associated with two suppliers due to boundary overlaps. We have addressed this administrative data issue by recognizing that in most cases this is a result of wholesaler boundaries subsuming strictly retailer boundaries, and in turn assigning conflicting parcels to the supplier with the smaller overall area. While this handles most cases, there still exists the possibility that one can filter out parcels associated with the smaller supplier prior to the join if one filters on APN and county, rather than APN and a unique supplier identifier. Avoiding the possibility of unwanted pre-join filtering by changing to this latter filter approach is therefore a methodological improvement.
Most importantly: we include this update only for scientific transparency. These edge cases were already included in our +/- 40 percent error estimates and data quality consideration flags. The aggregated landscaped area measurement of only one supplier not already flagged with data quality considerations has changed outside of original error bounds (“Shasta Lake City of”)."
The Model Water Efficient Landscape Ordinance ("MWELO") has set standards for efficient outdoor irrigation practices such as less water intensive plants and smart sprinkler systems since 1992.
Now meet Milo: a pup with all the aesthetics in a small package. Like MWELO, Milo achieves a great look with less muss and fuss. Efficiency!
Happy April everyone from all of us at the CaDC!
This April, the CaDC is working with members of a statewide water resources fellowship called CivicSpark to survey thousands of residents about their landscape preferences and outdoor water efficiency. This data will be used to understand what attracts, impedes, or discourages participation in water agencies’ rebate programs and to suggest improvements that could save California millions of gallons of potable water each year.
CivicSpark is a Governor’s Initiative AmeriCorps program and a partnership with the Governor’s Office of Planning and Research. The fellowship, run by the Local Government Commission, builds capacity in government agencies so they can better address water management and climate change issues. In this new partnership, the fellows on this project will recruit and train dozens of volunteers to conduct the door-to-door survey on several days in March and April.
Understanding landscaping decisions is a crucial first step in curbing water waste and changing the way Californians use water. By gathering a bulk of information about residents’ landscaping choices, watering habits and knowledge of state rebate programs, the CaDC will help identify the key factors that stop individuals from converting their lawns or having CA Friendly plants. We will also discover which resources are most useful for residents in adopting more water efficient practices, bolstering our previous research on the effectiveness of rebate programs.
For more detail on the project or to get involved please contact Steven Kerns at firstname.lastname@example.org
Happy 2017! The CaDC coalition is growing and starting the new year off with a bang with two new collaborations! The first is six new interns from New York University’s Center for Urban Science and Progress (where many of CaDC’s core staff worked). Second, CaDC staff has been working with Civic Spark Fellows at EMWD, IEUA, SAWPA along with CaDC technical working group participants to develop a survey on outdoor water use attitudes. Each of these collaborations is described in greater detail below.
CUSP Winter Water Data Internship
The six CUSP interns have been split into teams of two on each of the following three projects.
The CaDC Reservoir explorer tool provides a simple interactive interface to see the levels of California’s reservoirs. The California Data Exchange has a similar visualization for a smaller number of California's reservoirs. The underlying data, however, is opaque and a user clicking on "current data" gets a pdf version for printing rather than actual machine readable data that can be manipulated. This winter sprint will focus on automating data ingestion and conducting usability improvements to the tool.
Evapotranspiration measures the water needs of landscapes and forms a critical data point in Governor Brown's new long term framework for water conservation. This data is measured statewide in highly accurate though occasionally geospatially sparse in situ sensor networks and interpolating that data across California requires wind, precipitation, solar radiation and other publicly available data. Those data sources however are fragmented and integrating those together along with the appropriate scientific methodology can help improve this important measurement for the people of California. The goal of this three week sprint is to finalize the specific datasources to be integrated and code the first iteration of the parsers to automate the ingestion of those sources.
OWRS is an open data standard designed for analysts, economists, and software developers interested in analyzing water rates. OWRS attempts to fully encode a water utility's rate structure and pricing schedules in a form that is easy to store, share, modify and apply programmatically. Over the course of this winter sprint, CUSP interns will translate the rate structures of CaDC retail water utilities and other major water retailers into the OWRS format. This work marks the start of a comprehensive database of water pricing information to inform revenue stability and water pricing for utilities across the state.
Civic Spark / CaDC Survey on Outdoor Water Use Attitudes
The goal of this project is to understand the landscaping choices of California residents and to identify factors that influence outdoor watering practices. The core components of this work will include developing a household survey and recruiting and training volunteers to conduct the survey statewide. This will occur over the following timeline.
January – February 2017: Recruit volunteers.
February – April 2017: Train volunteers and conduct survey.
May 2017: Review and organize survey responses.
June 2017: Collect and present fellow and volunteer feedback to Inland Empire fellows.
July 2017: Develop transition materials.
The survey questions are still being finalized with partner agencies though the survey sampling methodology is available on the CaDC GitHub here.
Please let us know if you’re interested in contributing to a project or just dive right into the CaDC GitHub.
This memo examines the feasibility of AB1755, the Open and Transparent Water Data Act, through analysis of the current situation and case studies. Managing California's water resources during this period of extended and severe drought will require innovative policies, technologies, careful planning, and coordination among local, state, and federal agencies. This , in turn, requires detailed information and data on California's supplies of water, and how that water is used in the state. Currently, the provision of this data is fragmented, inconsistent, and is often locked in formats which are incompatible with modern research practices.
AB1755, the Open and Transparent Water Data Act, aka The Dodd Bill, aims to remedy the situation by mandating the construction of an integrated database containing water data. At a minimum, the bill calls for a data portal which makes available data on water supplies in a common, open, and well documented data format. The Dodd Bill calls for the integration of DWR data on groundwater levels, SWRCB data on water quality, operational data from the State Water Project and Central Valley Project, USGS hydrological databases, and data on fish abundance from the California Department of Fish and Wildlife.
Many efforts to provide integrated water data are underway even as the Dodd Bill works its way through the California State Assembly. This memo examines four open data efforts to better understand the challenges of integrating government data and discover best practices. Seed Consul ting's Water Log, the US Geological Survey's Water National Information System, Western States Water Council's Water Data Exchange, and the California Data Collaborative are profiled.
Successful projects have employed agile methodologies and open source systems, and open development frameworks. Agile methods work by breaking apart large problems and jobs into discrete tasks which can be accomplished in short periods of time. Thus, agile methods generate modular products whose success build on one another. For that reason, this memo suggest employing agile methodologies to take on the tasks specified in the Dodd Bill. Further recommendations include using a federated database approach to integrate water data. This approach takes advantage of existing capabilities and systems, making the most efficient use of scarce resources.
The ongoing drought and the forecast of more frequent and persistent droughts throughout the 21st Century has forced the State of California to review its water data management practices so it can most effectively steward California’s scarce water resources. Notably the Western Governor’s Association, the California Council of Science and Technology, the Bay Delta Stewardship Council and others have called for improvements in California’s water data systems.1
While the State of California collects a great deal of information on water use, supply, and rights, these data sets are inaccessible and create unnecessary burdens on staff. Too often such data comes in the form of executive summaries or reports. Policy and academic researchers cannot generally use this sort of data for their work without significant efforts and cost. This constrains efforts to evaluate and develop policy and severely impedes innovation from academia or the private sector. The data format itself often presents problems. For example, the State Water Project and the Central Valley Projects distribute data as PDFi documents. Although humans find such reports easy to read, extracting data for substantive analysis consumes time and resources. Ultimately the effort to extract data is wasteful as it duplicates work already done by state staff. Agencies and offices in the DWR and SWRCB have begun to address this issue, but a great of work needs remains.
Water does not follow agency or jurisdiction boundaries and hydrology is dynamic. Thus, understanding the balance of water supply and discharge requires timely integrated data. Researchers and water system managers should be able to easily know for example the amount of water coming from streams, lakes, reservoirs, and runoff into state water systems. While much of this data does exist within federal and state agencies, it may only be available as a summary table within a cumbersome web form or a PDF formatted report. It might even take a FOIA request to access the data in some instances. Procuring and extracting data in such formats consumes the time and resources of researchers and water managers needlessly. In addition, some researchers interviewed have spent upwards of a year searching for the data and waiting for agencies to package the data and deliver it. The time and effort needed to acquire data hobbles efforts to forecast California's supplies and develop effective conservation programs. The economic costs of these delays include not only the wasted time of professionals researching and managing California's water resources, but the cost of missed opportunities to detect problems and enact effective policies in a timely manner. Again, agencies have made efforts to open their data, most notably with the SWRCB which has created and open data portal and sponsors events aimed at developing new data collaborations. Collaboration with organizations outside of government will bring research and analytical capacities from industry and academia to planning policy processes and provide timely, high quality information to policymakers.
At the present time, many projects to integrate water data are underway, including important water data integration efforts within the California SWRCB and DWR. This memo will consider broader inter-agency water data integration efforts including the SEED Consulting Group’s Waterlog Project, the USGS Water Information System and the Western States Water Council’s Water Data Exchange as examples of ongoing data integration towards what the Dodd bill is calling for. In addition, this memo will examine the model underpinning the early success behind California Data Collaborative’s work with urban water use and synergies with pioneering civic data science nonprofit ARGO Labs.
The Dodd Bill
Assembly Bill 1755, The Open and Transparent Water Data Act calls for integrating the water data collected by state and federal agencies into a common data warehouse accessible by the public, policymakers, and researchers. The legislation calls for integrating DWR data on groundwater levels, SWRCB data on water quality, operational data from the State Water Project and Central Valley Project, USGS hydrological databases, and data on fish abundance from the California Department of Fish and Wildlife. This integration will be accomplished through common data formats and consistent metadata describing the data collected and released. The end goal is to create an accessible and comprehensive data warehouse of the state's water supply and use through common data protocols supporting the creation of innovative visualizations of time series and spatial data. The question now is just how it will be accomplished. Various projects have attempted to solve the data integration problem both inside and outside of government. Each employ different development methodologies and stem from divergent project management philosophies. Their successes and failures can serve as models for implementing AB 1755. More than that, project methodologies influence choices in technology, personnel selection, and scheduling. Thus, choices in methodology change the cost structure of the project and alter its feasibility.
At the present time, many projects to integrate water data are underway, including important water data integration efforts within the California SWRCB and DWR. This memo will consider broader inter-agency water data integration efforts including the SEED Consulting Group’s Waterlog Project, the USGS Water Information System and the Western States Water Council’s Water Data Exchange as examples of ongoing data integration towards what the Dodd bill is calling for. In addition, this memo will examine the model underpinning the early success behind California Data Collaborative’s work with urban water use and synergies with pioneering civic data science nonprofit ARGO Labs.
Seed Consulting and the Waterlog (SeedCG.org)
The Seed Consulting Group is a nonprofit consulting group founded in 2014 to tackle urgent problems in the environment and public health through partnerships between professionals and nonprofit groups. The professionals in Seed volunteer their time to bring expertise and projects dealing with environmental, public health and social problems. One of Seed’s projects is the Waterlog project which aims to aggregate all publicly available water data into a single, easy to use web site providing analytics and data to interested stakeholders working to solve California’s water problems. Seed built on well-known agile methods and employed a Microservice approach. Microservices recognize that the information world moves fast and now uses a wide variety of end user devices and interfaces. Nor is the data back end monolithic any longer. Behind the scenes, data may come from multiple sources run by different organizations. Microservices rest on a Four Tiered Application Model. This model breaks apart the traditional, single structure, view of software and re-envisions the application as an interconnected ecosystem of individual entities working together to deliver services. Under this model, complete data services do not have to be developed by a single entity. Rather, they can emerge through collaborative efforts or even through emergent coordination between independent developers taking advantage of opportunities made possible by the work of others. This concept builds on longstanding programming concepts such as abstraction and is made possible by modern open code libraries, frameworks, and data interchange protocols which provide both common programming tools and standard data formats. Using the Microservice Model, Seed was able to turn a large and loosely defined problem into a number of actionable and feasible goals. Rather than develop a complete solution, Seed focused on data harvesting and aggregation. Their site employs a REST APIii which provides users a gateway to Waterlog data. This architecture enables researchers to pull data on demand into their own databases through widely used software libraries. Seed Consulting’s Waterlog aggregates data from Department of Water Resources web sites. Rather than develop every analytical tool themselves, the Waterlog service encourages researchers to employ their REST API allowing them the freedom to pose their own research questions. Currently Seed Consulting is partnering with the California Data Collaborative to create a new version of their Waterlog project which will include data from additional sources and more advanced visualizations.
The US Geological Survey’s National Water Information System
The US Geological Survey provides another example for collaboration through data with its National Water Information System (NWIS). In most respects, this service is a traditional, monolithic information system. The USGS collects data from its extensive sensor and satellite network. This site provides a great deal about precipitation, groundwater, and stream flows as well as some data on water use reported by state water departments. This USGS also provides a great many analytical tools and graphs. However, the USGS realizes that researchers and analysts need access to raw data in order to conduct research. Thus, their site provides complete access to data in the National Water Information System using Web Servicesiii through a REST API and SOAPiv Web Services. The USGS provides an excellent service for those who need water data, but it only provides data collected by itself and the US EPA through their site. Although, the USGS only collects and distributes data from federal sources, their data can be readily integrated into research thanks to its Web Service and REST interfaces, providing an invaluable service to the researchers, policy analysts, and water managers.
Western States Water Council’s Water Data Exchange
The Western States Water Council (WSWC) has also launched an effort to aggregate water data for planning and management. Their Water Data Exchange (WaDE) is an ambitious effort to coordinate data from eighteen Western states. WaDE was begun in 2012 in response to the fragmentation of data on Western United States Water Supply. The system was designed to create a centralized, easy to access data repository of water data in a common format. WaDE, currently in Beta test, provides access to data on Water Use from many different states and federal agencies including the US Geological Survey (USGS) and California Department of Water Resources. WaDE provides data through a Restful Web Services interface which employees an XML vocabulary called WaDE XML.
The developers of WaDE made several decisions about design based on the political and organizational realities of a cooperative data project. The first decision was to make the platform open and decentralized. Under this scheme, each state operates a node containing data using their own infrastructure and format. Web Services link each node to the central WaDE repository using a combination of REST queries and Simple Object Access Protocol (SOAP) services. While each state’s node operates according to its own policy internally, it must respond to Query URLs from the WaDE client to pull data. The program specifies a small set of distinct services which allow for authentication, sharing of data catalogs, and on-demand retrieval of data. Each node shares catalogs of data following protocols resembling the Open Archives Initiative’s Protocol for Metadata Harvesting. This process synchronizes data catalogs to provide a central inventory of available data for the exchange. The project is also developing not only an internal data schema which organizes data, but an XML based markup language for data exchange. WaDE provides access to its collected data through a portal based on the REST protocol. This provides a very flexible interface for researchers of all kinds. Rather than impose any any sort of analytical constraints, WaDE provides the data and it attendant metadata to users through individually crafted REST queries. The developers of WaDE understand that their clients have many needs and would be hard pressed to cater to them all. To make the best of use of their limited resources, the developers decided to provide an open data gateway using REST. The Beta version of WaDE currently includes GIS applications using ARC GIS as well as a web-form based query tool. However, at this point in time, the project is not complete and only aggregates summary data provided by member states. At this point in time, the WaDE remains in development and can only be accessed with an account and password.
California Data Collaborative (CaliforniaDataCollaborative.com)
The California Data Collaborative was founded in January of 2016 as a coalition of municipal water utilities serving more than 3.7 million Californians. The California Data Collaborative’s work has already been honored by the White House as part of its March 2016 Water Summit and yielded impressive results in the form of new analytical tools, policy evaluations, and economic models. The California Data Collaborative achieved its results through the use of agile Methodologies which focus on incremental progress rather than creating a complete solution all at once. agile practices recognize and internalize the reality that projects change as they progress and emphasize frequent contact with stakeholders to ensure that current needs are met. Utilizing these practices, the Data Collaborative was able to divide a large and challenging project up into a set of smaller and easy to accomplish goals. Through its agile approach, the project was able to exploit emergent opportunities incorporate them into the larger whole on the fly. More importantly, by meeting frequently with utility managers who comprised the stakeholder community, the California Data Collaborative was able to build a strong business case for its work and ensure that the products met the needs of its users.
The California Data Collaborative breaks away from traditional modes of documentation (reports, spreadsheets, tables) and focuses on providing data for analysis and effective visualizations using open source tools. In contrast with the practices of the State Water Project's documentation of operations, the California Data will make its data available through Postgres SQL database accessed through a secure web interface supporting queries and providing visualization tools for registered users. As a great deal of data in this system is confidential, access will be limited to participating utilities and researchers using strong security to prevent breaches in privacy. This will let researchers quickly obtain what they need without the added effort of wading through multiple documents and reports searching for the right data only to spend further time extracting it and loading it into their own systems. The California Data Collaborative's SCUBA database also standardizes data from multiple sources into a consistent data structure which facilitates comparative research and evaluation.
Advanced Research in Government Operations Laboratories or ARGO Lab for short, is a leading civic data science nonprofit that has been featured by Fast CoExist, the Fox News Smart Cities initiative, and a half dozen newspapers in New York where it is headquartered. This organization uniquely brings together civic data science and governmental operational expertise together to improve public service delivery. Cities typically have a lengthy procurement cycle which solicits bids and develops specifications which tend to assume each technology component must be individually developed. The bid process can also be quite expensive, and once a bid has been accepted, it may take two years or longer for a project to reach production. In business, two years encompasses and entire technological cycle in which product and entire systems become mature and become obsolete. In contrast to this model, ARGO's team utilized widely available commercial off the shelf technologies to build a project platform and develop a prototype product. This example demonstrates the power of an agile approach which employs interoperable and open technologies.
Consider an age old problem plaguing New York and most other cities: the pothole. Causing damage to cars and injury to cyclists, potholes resist conventional efforts to detect and repair them. They're widespread and emerge quickly. City Works departments struggle to keep up with them in large part because of the effort needed to detect and catalog potholes. Varun Adibhatla and his colleagues envisioned a scaleable detection system which could measure the entirety of New York City's pothole problem. This would be the first step to solving that problem. Using the Raspberry Pi as an integration platform, the ARGO team was able to rapidly prototype a device which would detect a pothole and automatically take a picture while recording its GPS coordinates. Thus was born the Street Quality Identification Device (SQUID). The prototype device was developed rapidly between April and May of 2015. Secured to an automobile, the ARGO development team was able develop an accurate map of their neighborhood's potholes. They key to this rapid prototyping effort was the Raspberry Pi computer. This device is single board microcomputer not much bigger than a smartphone. This computer is a low cost, bare-bones device which runs a lightweight Debian Linux operating system. This device, while not a personal computer, is an information widget which can manage sensors, control robots, and fit into almost any custom computing and informatics application. Technologies like this can be easily acquired and provides an easily programmable and extensible central processor which can gather data from sensors, organize it, and deposit into municipal information systems which few intermediate steps. As it stands, the SQUID provides an example of alternatives to traditional civic procurement and development cycles.
ARGO has also provided pro-bono assistance to the California Data Collaborative including a novel partnership with Enigma Technologies, a leading civic data startup that powers the world’s largest repository of public data. The tools enabling the integration of that data have been provided pro-bono by Engima to scale the California Data Collaborative’s early success to realize its vision of integrating the entire lifecyle of water data in California.
These case studies provide points of comparison by which development strategies can be evaluated. The USGS and WSWC data exchanges represent traditional Information Technology projects. They are large scale, unitary, top down systems. The WaDE database attempts to solve the entire water data solution at a single blow. While this system holds a great deal of promise, that promise has yet to be realized despite a development time of four years. This lengthy development cycle highlights the weaknesses of such a top down approach. In the fast moving world of Information Technology, four years will see entire technologies rise and fall into obsolescence. Thus, many assumptions in a traditional project may become irrelevant, and, user needs will likely change. This leads to costly delays as the project is redesigned to fit new baselines. Worse yet, such projects tend to build in features which never get used and often fail to meet user needs. This comes despite earnestly developed plans and the hard work of their design and implementation teams. In contrast, the California Data Collaborative, Seed Consulting, ARGO Labs, and the USGS work to solve large problems one piece at a time using agile Methods. These projects offer many lessons on how to create an integrated water data warehouse. The California Data Collaborative and Seed’s Waterlog project have created usable resources in very short periods of time. The USGS Water Information System only seems monolithic at first glance because of the system boundary drawn around large array of interoperable subsystems working together to collect, organize and aggregate data. This system was not developed overnight nor was it created and released whole cloth. Rather it was developed through an iterative processes which refined goals at each step and built on the successes of each previous iteration. This continuous improvement lies at the heart of agile Methodologies. Likewise, the California Data Collaborative and the Waterlog Project both use iterative and agile development methods to make rapid progress which can be demonstrated to stakeholders. This suggests that the data integration mandated by AB 1755 can be undertaken using agile Methodologies which break the overall goal down into smaller tasks and turning out working products which accomplish pieces of the larger goal.
Agile Methodologies rely on open sourcev tools. These include development frameworks, programing languages, software libraries, applications and operating systems. The low cost and large development communities associated with open source systems provide advantages which make it easy to get a major data project off the ground quickly. Not only do these systems have a low cost adoption for new projects, they often work together by design, using standardized data structures such as JSONvi and XMLvii. These standards provide benefits not only to end users of data, but to development teams as well allowing even loosely connected partnerships to work effectively together. The California Data Collaborative’s partnership with Seed Consulting on a new version of Waterlog demonstrates these advantages.
The work of both Seed and the California Data Collaborative demonstrate another essential fact. Those working to integrate data cannot wait for all the agencies involved to come to consensus on data schema, formats, and representations. This is not to say that such consensus cannot be reached. In fact, agencies largely agree on the need for common data structures and protocols and have begun working towards these goals. Many have made significant strides towards this goal already. However, most agencies lack the resources to make profound changes quickly to accommodate the goal of data integration. This means any project which aims to develop an integrated database of water information must take the world as it is rather than as it should be. Currently, at least two state agencies provide data access through a REST interface as does the USGS. These initial efforts lay the foundation for rapid progress towards further integration.
Some agencies, on the other hand, rely on legacy systems and practices difficult to replace or redesign while retaining operational integrity. In the case of the State Water Project and Central Valley Water Project, data is made available through PDF formatted reports containing tabular data. Those working to integrate water system data cannot wait for these agencies to update their practices. The urgency of this is underscored by the need for real time information on reservoir levels and outputs in estimating water supplies. Fortunately there exist a variety of tools for scraping data from web pages and for parsing data contained in human readable formats such as PDF. The practice of Web Scraping extracts information from web site using software tools and can be employed on both the SWP and CVP sites to harvest data and populate an integrated water databases. Many tools for Web Scraping are available. For instance, The Python(viii) programming language excels at this sort of work and has a large software library of third party modules optimized for the job. For processing text data, Python again offers a diverse assortment of capable modules that can power proven tools for integrating public data through agile methodologies such as Enigma’s Parsekit.ix Parsekit is a mature tool which facilitates the entire Extract, Transform, and Load process. Parsekit scales well and allows developers to do more, faster. Tools like Enigma’s Parsekit diminish the obstacles of divergent data formats and protocols.
The primary recommendation is this memo is to work towards data integration using agile Techniques and open source software and systems. However, some specific systems recommendations come first. In terms of specific systems, since California’s Department of Water Resources already runs a Tier 3 Data Center, it makes sense to utilize its existing capacity to host an integrated water database. However, cloud services such was Amazon Web Services or Liquid Web would easily meet any system requirement at a low price. The Postgres Database Management System (DBMS) offers features which make it ideally suited to representing data on water use and supply. It supports the data types needed for both scientific modeling and geospatial analysis. An open source DBMS, Postgres can be acquired and maintained at a low cost. It is also highly customizable enabling developers to modify to handle any requirement. As a fully compliant SQLx database system, Postgres is interoperable with most development frameworks and web servers.
While selecting the proper system components is important, the central recommendation pertains to software practices and culture, in this case, agile Methodologies. agile methods which work in short bursts called Sprints, rapidly planning, building and testing small components of larger projects. The principle working unit of the agile Methodologies is the scrum. A scrum is a cross-functional team of developers, domain experts, analysts and others working in close collaboration to accomplish a discrete and well bounded goal. A project manager or Scrum Master oversees this team, acting to resolve deadlocks and manage conflict. The Scrum Master also gathers data with which to evaluate progress. Planning, review, and retrospective meetings border the sprints on each side. Planning meetings guide the project, assigning personnel, allocating resources, and creating a project plan. At the end of each sprint, a review meeting with the development team and stakeholders evaluates the progress of the sprint. In this meetings, goals are refined and opportunities explored. Each review meeting also helps to build the project’s business case through frequent contact with stakeholders. The review meetings ensure the development team and stakeholders share the same values and priorities. Retrospective meetings build in a reflexive component which forces the team to review their behavior and performance with eye towards improvement and capacity building. These retrospective meetings put development teams on a course of continuous improvement resulting in more productive sprints which accomplish more.
The Dodd Bill stipulates requirements rather than architectures leaving developers latitude to develop specific system architectures. As the overall mission of the bill is to integrate data already being collected and distributed, developers need not stick with conventional database designs. A federated database design seems well suited to the task of integrating data. Federated databases differ from traditional databases in that they represent a coordinating layer which sits atop other databases which are geographically distributed. Businesses commonly use federated database to link regional databases into single data store. A federated database system may include a great number of very heterogeneous systems differing my operating system, vendor, and data communication protocols. The modern federated approach goes beyond a mere networking of data. Systems such as Postgres, MySQL, SQL Server, and Oracle provide sophisticated tools for incorporating external data and making it seem as though it resides in a single database. The data wrappers provided by most systems do more than pass queries from one system to another. They translate data schema from one database to that of another. The federating system will then keep a local cache of records increasing performance, improving reliability, and reducing stress on member systems. This approach promises more than simple speed to implementation. A federated system avoids duplication of effort and avoids switching costs associated with technology as it integrates and complements existing systems rather than replacing or superseding them. Modern federation tools provide another important advantage. Rather than needing to work out any special agreement for data sharing among agencies releasing public data, a federated database can operate over web services protocols like REST and SOAP. For instance, the USGS makes its extensive data on state, regional, and national water systems available to the public, free of charge, using a REST API over the web. Thus, a federated database can simply initiate data exchange over existing and public channels right away. Likewise, other systems not contained in the DWR's Tier 3 Data Center can be integrated with a minimum of planning and development. Inside the data center, existing database systems can be linked rapidly and with little latency in data communication allowing a high performance integrated water database to be implemented rapidly, without disruption to existing services, and without delays due to switching over. For these reasons, using a federated design will minimize costs while providing a very cost effective solution the problems of integration and coordination.
Conclusion and Future Work
Though integrating data from the lifecycle of California’s water may appear daunting, possibilities for rapid progress exists. Small teams employing open source tools and agile Methods have already begun to chip away at the larger problem by breaking it down into smaller pieces and attacking each one in rapid development cycles or sprints. Although agile Methodologies run counter to traditional top-down planning, they are well suited to meet the demands currently placed on governments. Governments face constrained budgets and a public mood which demands greater efficiency and accountability in government projects. Beyond these considerations, the system mandated by the Dodd Bill must contend with the nature of policy development and political processes. Policymakers have a limited attention span and great demands placed on their time. The legislation calls for the creation of a fund to provide for the development of an integrated data system, but does not appropriate money directly. The second, and currently unwritten, part of the Dodd Bill is funding which will require a further case to be made for appropriations. Even if the money comes from the DWR budget, agency officers must be convinced that investments in the project are well spent. For that reason, system developers must pay close attention to both political mood, and policy development processes. The streams of policies, problems, and politics only rarely come together to create windows of opportunities for new programs. These policy windows seldom stay open long as events move forward and legislative priorities shift. These pressures will force system developers to have something to show for their efforts quickly. If not a complete system, then working components of a larger system must be realized while policymakers and agency heads are focused on the problems the Dodd Bill aims to solve. Agile development methodologies excel at developing these sorts of products and in creating efficient and accountable teams.
Two teams in particular have achieved notable success using these practices, The California Data Collaborative and Seed Consulting. Since it started in January, the California Data Collaborative has successfully integrated water use data from many utilities. It has already created tools for program evaluation, policy analysis, and planning. Seed Consulting has, likewise, developed a database which collects information on water supply and quality from a number of DWR and SWRCB sources, aggregating it into single data store, and disseminating it through a REST interface to researchers and water managers. The USGS likewise appears to have used similar approaches in unifying its vast collection of hydrologic data. All three of these projects prove that the integration called for in AB 1755 is feasible and that the benefits of such a project can be rapidly realized. The next step will be to create a development roadmap assessing the state’s current water information infrastructure and planning the integration work itself. Ongoing research with regard to AB 1755 will include a Cost-Benefit Analysis and an examination of how timely data would impact forecasting and planning.
i PDF or Portable Document Format is a file format which presents text documents consistently regardless of operating system, platform, or device.
ii REST is an architecture for data exchange services on the World Wide Web. API stands for Application Program Interface and is the portal through which users may issue data queries and have those requests fulfilled.
iii In this context, Web Services refer to the protocols by which computers transact data transfers on the World Wide Web.
iv The Simple Object Application Protocol (SOAP) is a data exchange protocol on the World Wide Web which allows computer programs to seamlessly exchange data, acting as larger, networked software system.
v open source software is available for free and grants the user the right to review the code and modify it for any purpose. Such software is developed through the collaborative efforts of companies, organization, and individuals. open source software is used in many collaborative efforts.
vii The eXtensible Markup Language provides information on how to handle documents formatted with it. XML is often the underlying format of well-known document types such as Microsoft Word and many web pages.
viii Python is a programming language widely used in science and engineering applications. It has data structures which efficiently handle scientific data. Python has also provide capable in a variety of of settings ranging from Web Applications to Database programming, and Data Analytics.
xi In Computer and Data Science, parsing denotes the act of extracting data from one system and translating it the internal data structures of another.
x SQL or Structured Query Language is the most commonly used database programming language which creates, reads, updates and deletes database records.
Appendix A: California State and Federal Water Data Exchanges
State Water Project Reservoir Operation
State Water Project
Reservoir and Aqueduct
Operations and flow data available in PDF format. They will share data in other formats by request, They prefer Excel.
California Environmental Data Exchange Network(CEDEN)
An integrated water data quality database bringing together data collected at regional centers and from 3rd party partners.
California State Groundwater Elevation Map (CASGEM)
Online portal with more detailed data than the water data library. Shows charts of groundwater elevation over time. Data is organized by County, Basin, Monitoring Entity, or well type as CSV files.
California Water Data Library
Water Quality and Flow Data
A data library containing output of all continuous and discrete data. The library is not documented, making it difficult to know exactly what data it contains.
Water Information System
Urban Water Use
California Data Exchange Center (CDEC)
Rivers and Dams
This database provides information on river flow and dam output in a variety of formats. It is primarily a scriptable CGI using the GET method. This database has information on river conditions, snow melt run off, snow pack, statewide river conditions, and water allocations. Also has detailed forecasting information.
Central Valley Project
US Bureau of Land Reclamation
Dams and Reservoirs
Operations data for Federal projects in California like the CVP in PDF format.
State Department of Fish and Wildlife Fish Abundance
California Department of Fish and Game
This is a straight, HTML table report of fish abundance by type. The reports go back 1967 and extend up through last year. Will need to know methodology to get a better idea of how current their internal data is. They may not release 2016 data until they have finished analyzing and collecting all the year’s data. This will be made clear in interviews.
US Department of Fish and Wildlife BIOGEOGRAPHIC INFORMATION & OBSERVATION SYSTEM
US Dept. of Fish and Wild Life
Fish Abundance and others
This is a comprehensive database of GIS and time series data on factors relevant to fish and wild life. Very carefully maintained and and well organized with clear metadata. Note: California Departmentt of Fish and Game originates this data and the USFWS collects and archives it. So, this essentially duplicates the data provided by the state but it might be the designated repository.
NOAA Fish Abundance
Assessments based on data California Department of Fish and Game reports.
USGS Water Information System
Surface Water, Ground Water, Stream Levels, Land Use, Irrigation, and Biome Data.
This searchable database provides a great variety of water data on streams, lakes, reservoirs, land use, precipitation and biome data. The site provides a set of graphical and analytic tools and disseminates data through a fully documented REST and Web Services API.
California Irrigation Management Information System (CIMIS)
Irrigation Water Use
This data contains detailed data on evapotranspiration of irrigated land. Data is made available through a Windows application and through a REST API.
Appendix B: Summary of Interviews
List of Interview of questions
From May through June 26, the managers of California's water data providers were contacted for interviews on their current practices to assess current practices. The questions asked in the interviews are listed below. No special recording or annotation software were employed.
What are the databases types (SQL, PostgreSQL, etc.) underlying you data portal?
What servers does your organization use?
Are they on site or in the cloud?
Who directly does the DBA work for your organization?
How many FTEs directly administer the databases?
What is the budget for your organization's data publishing efforts?
What development practices do you use to manage your data store?
Is your data open to the public?
In terms of the open data plan, is there a RESTful API or similar direct access scheme planned for the future?
Why or why not?
Do you have a database schema you can share?
Do you have an ingest schema you can share?
Summary of Interview Responses
Greg Smith of California DWR California Groundwater Elevation Monitoring (CASGEM), 6/16/2016
CASGEM makes available the data gathered by the state's groundwater monitoring system. They use Oracle and SQL Server as they database platforms and their servers are hosted on site in the DWR's Tier 3 Data Center. They're DBA work is performed by the DWR's IT Division. He does not the budget for the hosting and staffing for the databases as that comes from the DWR's general fund. Nor does he know the number of FTEs needed to support the project; that's handled by the DWR. CASGEM uses Agile practices to develop under .Net and Java. CASGEM data is open to the public through a web interface. There are no plans to develop a REST interface due to a lack of resources.
Jeremy McHugh, US Geological Survey, National Water Information System, 6/16/2016
The USGS used Oracle for its database platform and hosting is done both on-site at the California Science Center's Tier 3 Data Center and off site through Amazon Web Services. Development is done in house and the USGS does not contract out. The budget is around $840,000. Jeremy did not know what development practices were used, but believed it varied by project. The data is open to the public and is made available through REST and SOAP Web Services.
Bekele Temesgen, California DWR, California Irrigation Management Information System (CIMIS), 06/22/2016
CIMIS uses Oracle, GRASS GIS, and SQL Lite embedded in its web application to store and query data. Hosting is done on site through the DWR's Tier 3 Data Center. The DWR also provides DBA services. CIMIS employees one full time staff member to develop the web applications. They tend to use a traditional Waterfall development methodology, though they also utilize agile practices for some projects. They data is available to the public.
Jarma Bennett, State Water Resources Control Board, California Environmental Data Exchange Network (CEDEN)
CEDEN hosts their data using SQL Server. Their database is hosted at the DWR's Tier 3 Data Center and is supported by DWR's IT Division. Their development employees one DWR IT Division employee working ¼ time and the SWRCB Technology Director working ¾ time. CEDEN will also contract out for application development projects. Jarma did not know what development methodologies were employed. The annual budget for CEDEN is $850,000. They are currently working on developing a REST web interface.
Thank you to our speakers, sponsors and lively attendees who made our inaugural Stanford GSB Water Data Summit such a success! Please see below for photos from the event and the updated analytics page for detail on the tools shown during the summit.
Assemblymember Dodd's "Open and Transparent Water Data Act" inventories and asks that various public water data sources be integrated. To show the value of agile approaches, the CaDC partnered with Seed Consulting Group to integrate reservoir levels into a REST API as a small pilot of the broader envisioned water data integration.
Properly formatted integrated data makes it orders of magnitude easier to visualize reservoir levels than manipulately malformatted html tables or machine illegible pdfs. The resulting visualization can be seen below, which was produced in a few days by our newest data scientist at the CaDC, David Marulli.
The California Data Exchange, upon which the underlying Seed REST API was built, has a similar visualization for a smaller number of California's reservoirs. The underlying data, however, is opaque and a user clicking on "current data" gets a pdf version for printing rather than actual machine readable data that can be manipulated.
The REST API has room for improvement, such as the ability to query by more than one reservoir at a time. This example is intended to illustrate the value of agile approaches to quickly show value and iterate rather than attempting to bite off everything all at once. A feasibility study of the water data integration proposed by the Dodd bill will be released in the coming weeks.
Objective: Improve water sales forecasting under different rate structures
The California Data Collaborative (“CaDC”) operates a unique inter-utility customer water use data warehouse on behalf of water utilities serving 22 million Californians and is soliciting proposals on the effects of price on water demand.
Water rates play a critical role in funding utility operations, and they are also increasingly used as a tool to encourage conservation in a world of unpredictable supply constraints. Rates, revenue, and demand are tightly coupled and a more granular understanding of how these factors interact across different hydrologic, socioeconomic, and other unique local circumstances will enable water utilities across California to more effectively plan for an uncertain future.
As such, the Collaborative is looking for research proposals that would make use of our data to answer questions such as the following:
How accurate are current rate models at predicting demand changes due to a rate shift?
What prices do customers respond to most strongly: marginal price, average price, or total bill? How do the differences in these effect sizes compare with recent behavioral approaches such as changing how information is presented on a bill?
How does price elasticity change during severe drought conditions and mandatory conservation requirements?
The California Data Collaborative collects and curates customer-level metered water use data from a number of California water utilities. Contingent upon utility approval, access to this data may be granted to external researchers to conduct studies that benefit the entire water community in California.
Current participating utilities include:
- Moulton Niguel Water District (MNWD)
- Irvine Ranch Water District (IRWD)
- Santa Margarita Water District (SMWD)
- Eastern Municipal Water District (EMWD)
- Monte Vista Water District (MVWD)
- Las Virgenes Municipal Water District (LVMWD)
Prospective utility partners include:
- Western Municipal Water District (WMWD)
- City of Sacramento (Sacramento)
- El Toro Water District (ETWD)
- City of Anaheim (Anaheim)
- City of Newport Beach (Newport)
- Los Angeles Department of Water and Power (LADWP)
The data include basic monthly billing information such as amount of water consumed, and date of use, along with key contextual attributes such as evapotranspiration, customer class, and in some cases household size and irrigable area. Utility specific customer classes have been standardized into statewide classifications aligned with the Department of Water Resources (single family residential, multi-family residential, commercial, industrial, irrigation, institutional and other). In many cases information is also available about which customers participated in water efficiency rebate programs like turf removal or high efficiency toilet rebates.
The SCUBA data warehouse is a growing repository, and new contextual attributes are in development such as assessor parcel numbers that allow matching to county assessor property attributes, and census identifiers that allow inclusion of census block and tract statistics.
Example Study Ideas
The following have been identified as areas of interest that may yield insights into the effect of prices and rate structures on water demand. This list is not exhaustive and serves mainly to highlight the types of studies made possible through the unique inter-utility data set maintained by the collaborative.
LVWMD and SMWD recently updated their rates (the links below show their prop 218 Notices)
Natural and Quasi-experiments
Many participating agencies share borders, allowing for the discovery of quasi-experimental setups where factors such as weather, local regulation, and even environmental attitudes may be implicitly controlled for. This allows for more accurate assessment of the impact of rates and rate shifts. Examples:
EMWD / WMWD
IRWD / MWND / El Toro / SMWD / Newport Beach
LVMWD / LADWP
Utilities sometimes consolidate with one another, resulting in what can be seen as an exogenous shock. One example is IRWD: (http://www.irwd.com/about-us/consolidations)
Orange Park Acres - find effect of “water rate differential” and the ultimate transition to match IRWD rates on July 1, 2015.
Testing how customers respond to total bill / average price / marginal price
Compare price elasticities between customers under automated e-billing schemes and those that receive a paper bill to test the effects of information disclosure.
Submit an idea for a study
*All studies leveraging CaDC data must make the underlying code available.
Last Thursday and Friday UC Davis's Center for Water and Energy Efficiency hosted an extremely valuable and necessary workshop at East Bay Municipal Utilities District in Oakland discussing how to safely streamline how we share water data in California.
One big takeaway echoed over and over again by many participants was that the transaction costs of sharing data through current practices are unnecessarily and often prohibitively high. The current ad hoc and fragmented data sharing process is made challenging both for utilities and prospective analysts through several key barriers:
Legal -- data sharing agreements are often bespoke for each individual engagement, costing additional time and money.
Technical -- water data generally isn't standardized and comes in different formats creating substantial extra legwork for analysts.
- Organizational -- data sharing relies heavily on connections that can be opaque to new entrants to the water world.
These barriers create transaction costs that prevent many utilities and prospective analysts from participating in the "marketplace" of water efficiency analysis. Larger water utilities have the resources to participate in this marketplace yet the vast majority of California's 411 major retailers and thousand plus small water systems simply do not have the requisite resources.
Since launching in January, the California Data Collaborative has grown to 9 utilities yet we're still very much dealing with the early adopters of water data sharing. The big question moving forward is: "How can we streamline water data sharing so all of California's water utilities can easily participate in the marketplace of water efficiency analytics?"
Part of that will be achieved through the economies of scale that the California Data Collaborative offers through our shared service model to lower the per utility cost of hiring civic data science talent.
There's also additional institutional issues and our academic partner UC Davis CWEE did an excellent job of surfacing and taking leadership on those issues through the workshop last week and by developing a "trust framework." Towards that goal of streamlining water data sharing, we've learned a great deal from our initial work with Data Collaborative utilities on how to standardize data sharing and data transfer procedures.
Please see here for our standard NDA and here for our standard data ingestion worksheet that we use to onboard new utilities and standardize their customer use data into our statewide schema. The Nondisclosure agreement currently does not enable data resharing which is a key aspect of streamlining this "marketplace" yet does provide a template from a leading water law firm.
Our hope is that by making those documents open it will help others looking to share California's water data and deliver the analyses water managers need to navigate the big water challenges we face as Californians.
EXISTING STATEWIDE WATER USE REPORTING
Currently, California collects utility-level water use metrics through three channels: the State Water Resources Control Board (SWRCB) conservation reporting, Department of Water Resources (DWR) Urban Water Management Plans, and SWRCB Drinking Water Program public water system statistics (PWSS).
These sources provide similar but distinct water use data online over overlapping dates. Starting in 2013, SWRCB has reported monthly averages online while also tracking mandatory water reductions and enforcement. DWR has annual usage data from 2010 available online in excel format but is updated only once every five years. Lastly, the public water system data available online has tracked annual use (2011-2013) and monthly use (2013-2014) updated on an annual basis.
Because each state agency collects data differently, the end result is a compilation of usage information of varying metrics, scales, and without context. The absence of standardization and an inconsistent reporting structure significantly hinders and delays attempts to perform rapid and accurate analyses to compare current water use patterns within and across agencies and to project future trends and program effectiveness.
Current Statewide Urban Water Use Data World
Local factors such as evapotranspiration, irrigable area, and demographic characteristics need to be integrated with existing usage data to enable “apples to apples” comparisons of urban water use. The California Data Collaborative has worked to supplement existing statewide urban water use reporting with that key context through novel data integration (evapotranspiration, population) and partnering with remote sensing experts (irrigable area).
Adding in key context
LIMITATIONS OF REPORTING TOTAL AND AVERAGE WATER USE BY UTILITY
The current monthly and annual total production and average water use statistics collected by the state obscure the substantial variation within urban water retailers whose populations vary from a few thousand to over four million, sit in vastly different microclimates and have widely different land use characteristics. This wide variation prevents meaningful comparisons across agencies. This data fails to capture temporal and intra-agency variations. Together these key contextual variables allow for meaningful benchmarking of water use at an aggregate level. For example, utilities often have rebate programs, conservation outreach and rate shifts ongoing simultaneously. With only total or average water use data available, it makes analyzing what has worked to achieve water efficiency challenging, if not impossible, at a statewide level.
NEW DATA INFRASTRUCTURE TO POWER A QUANTUM LEAP FORWARD
A coalition of local water utility managers have come together to test the value of integrating and standardizing customer use data. By providing their daily metered water use data along with the key contextual information described above (ET, irrigable area, population) to the Strategic California Urban Water Use Analytics (SCUBA) data warehouse. The increased knowledge from the Data Collaborative analytics is intended to provide utilities with a radically more rapid view of program effectiveness, cost per program and how to better reach and respond to customer water use behaviors. In addition, statewide data integration and standardization will allow for comparative analyses of customer usage changes attributed to water conservation initiatives such as turf rebates and public outreach as well as projections of future patterns.
This new SCUBA data warehouse will help routinize best in class econometric evaluations such as measuring the impact of service area specific rates on their unique customer water demand. These measurements will support water managers in implementing rate structures that have both a conservation price signal and ensure revenue stability with less water sales. That underlying data infrastructure has broader uses in providing analytics to water utility managers better, faster, and cheaper to power targeted marketing, program evaluation and demand forecasting. We’ll elaborate those in the upcoming posts, but the underlying value of measuring demand management actions at the customer level is simple. California needs prudent water management to navigate an uncertain future, and, as Peter Drucker’s famous quote suggests, “you can’t manage what you can’t measure."
 More than 3000 connections.