Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...

Code and Data for CAMELS-Chem Concentration-Discharge Analysis


Authors:
Owners: This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource.
Type: Resource
Storage: The size of this resource is 348.9 MB
Created: Jul 17, 2023 at 11:15 p.m.
Last updated: Jun 05, 2024 at 12:37 p.m.
DOI: 10.4211/hs.eddb06e91a914618a89a63bb2c2774e0
Citation: See how to cite this resource
Content types: Single File Content 
Sharing Status: Published
Views: 320
Downloads: 29
+1 Votes: 1 other +1 this
Comments: No comments (yet)

Abstract

Here we provide the data and R scripts to complete the analyses and create the figures presented in the manuscript titled, “Solute export patterns across the contiguous United States” by Kincaid et al. 2024 at Hydrological Processes. Importantly, this resource contains paired solute concentration (C) and discharge (Q) data for 11 solutes from CAMELS-Chem (Sterle et al. 2024; https://doi.org/10.5194/hess-28-611-2024). This relational database was built upon the CAMELS dataset (https://doi.org/10.5194/hess-21-5293-2017), an existing dataset of catchment and hydroclimatic attributes from relatively undisturbed catchments across the contiguous United States. The version of CAMELS-Chem provided here has US Geological Survey (USGS) National Water Information System (NWIS) C and Q data for 506 catchments. C and Q measurements span from 1898 to 2020 with the first paired C-Q sample occurring in 1924. Solutes include aluminum (Al), calcium (Ca), chloride (Cl), dissolved organic C and N (DOC, DON), magnesium (Mg), nitrate (NO3), potassium (K), silica (Si), sodium (Na), and sulfate (SO4). Of note, a shorter version of the CAMELS-Chem database that spans from 1980 to 2018, but includes data for more stream water quality constituents and atmospheric deposition data is described in CAMELS-Chem (Sterle et al. 2024; https://doi.org/10.5194/hess-28-611-2024) and available for download via Hydroshare (http://www.hydroshare.org/resource/841f5e85085c423f889ac809c1bed4ac).

The R scripts and data files provided in this resource are intended to allow users to replicate the tables and figures in the Kincaid et al. manuscript. Specifically, we provide all files to complete the analyses coded in in the R script 9_analyses_figures_for_manuscript.R. However, other R scripts and data files provided should allow users to replicate intermediate steps in the analyses as well. See the README file for more details, but analyses provided in the R scripts include: modeling C-Q relationships with the power-law function using data-driven Bayesian segmented regression; conducting hierarchical clustering to group catchments based on catchment attributes; building random forest models to select catchment attribute correlates of C-Q metrics; conducting flow-duration exceedance probability analyses; and general code for figures, tables, and other statistics presented in the Kincaid et al. manuscript.

The metadata for the CAMELS-Chem dataset (camels_chem_all_2022-02-25.csv) is available in camels_chem_metadata.csv

Subject Keywords

Coverage

Spatial

Coordinate System/Geographic Projection:
WGS 84 EPSG:4326
Coordinate Units:
Decimal degrees
Place/Area Name:
Contiguous United States
North Latitude
49.6058°
East Longitude
-66.9141°
South Latitude
24.5202°
West Longitude
-125.2734°

Temporal

Start Date:
End Date:

Content

README.txt

R workflow for Kincaid et al. 2024 in Hydrological Processes

The code and data files provided in this repository are intended to allow users to replicate the tables and figures in the Kincaid et al. 2024 manuscript. Specifically, we provide all files to complete the analysis coded in 9_analyses_figures_for_manuscript.R. However, using the other scripts and data files provided, users should be able to replicate intermediate steps in the analysis as well. The metadata for the CAMELS-Chem dataset (camels_chem_all_2022-02-25.csv) is available in metadata/camels_chem_metadata.csv

Table of contents:
1. Initial file directory structure
2. (How to) Fit C-Q using Bayesian Linear and Segmented Regression
3. (How to) Hierarchical clustering of CAMELS catchment attributes
4. (How to) Feature selection for correlates of C-Q model class/archetype and slope value
5. (How to) Flow-duration exceedance probability analysis
6. (How to) Figures, tables, and statistics for manuscript

1. Initial file directory structure:
* Note: these folders and files should be located within an R project
* /code/
o 1_summarize_camels_q.R
o 2_prep_data_for_bayes_mcp.R
o 3_fit_lm_bayes_mcp_high_n_sites.R
o 3_fit_lm_bayes_mcp.R
o 4_classify_cq_pattern.R
o 5_hac_cluster_camels_catchments.R
o 6_random_forest_prep_data.R
o 7_random_forest_feature_select.R
o 8_flow_duration_breakpoint_analysis.R
o 9_analyses_figures_for_manuscript.R
* /code/functions_fit_bayes/
o functions_fit_bayes_with_mcp_high_n_sites.R
o functions_fit_bayes_with_mcp.R
* /data/
o ansi_us_state_codes.csv
o /camels_attributes/
* 8 files  camels_clim.txt to camels_vege.txt
o camels_chem_all_2022-02-25.csv
o camels_cq_breakpoint_flowdur_perc.csv
o camels_hac_clustered_all_attributes.csv
o camels_modClasses_CQparams_imputedCAMELSattrs_forRF.csv
o moatar_etal_2017_wrr_data.csv
o table_of_camels_attributes.csv
o USGS_gauge_info.csv
o usgs_streamflow/
* 18 folders with .txt files with streamflow data
* /fit_results/
o All files ending in CQ_data_and_classification.csv in fit_results
* 11 files, 1 for each solute in the manuscript analysis
* 20 files, 1-2 for each solute in the manuscript analysis
* /results_random_forest/
o All files ending in rf_performance_metrics.csv
* 11 files, 1 for each solute in the manuscript analysis
o feature_importance/
* All files ending in rf_feature_importance.csv
* 11 files, 1 for each solute in the manuscript analysis

2. Fit C-Q using Bayesian Linear and Segmented Regression
1. Summarize USGS discharge data to get minimum and maximum discharge values for each gauge site
a. R script: 
i. 1_summarize_camels_q.R
b. Data files required: 
i. usgs_streamflow folder containing discharge data from CAMELS website
1. Downloaded from the CAMELS website (https://gdex.ucar.edu/dataset/camels/file.html) on 12/14/21
c. Resulting file(s): 
i. USGS_dailyDischarge_range_all_sites.csv
2. Prepare concentration-discharge data from the CAMELS-Chem dataset (one CSV per solute; solute indicated by XX in resulting file name)
a. R script: 
i. 2_prep_data_for_bayes_mcp.R
b. Data files required:
i. camels_chem_all_2022-02-25.csv
ii. USGS_gauge_info.csv
iii. ansi_us_state_codes.csv
iv. USGS_dailyDischarge_range_all_sites.csv
c. Resulting file(s):
i. camels_chem_for_bayes_mcp_XX.csv
1. Note: XX above is the solute abbreviation
3. Repeat previous step for each solute of interest
4. Fit linear regression of log(C) ~ log(Q) using Bayesian analysis using JAGS via the mcp R package as the frontend
a. R script:
i. Primary R scripts:
1. 3_fit_lm_bayes_mcp.R
a. Try running this script on your data, but if you have a site with too many paired C-Q measurements (e.g., n >1000) you may max out the memory capacity of the computer you are using for the analysis. If this is the case, I run the sites with high n using a different R script, which runs functions with fewer iterations, decreasing memory demand.
b. Note: I often create a separate R script file for each solute I run and will include the solute abbreviation in the file name
2. 3_fit_lm_bayes_mcp_high_n_sites.R
a. This is the alternative R script for sites with a large number of paired C-Q measurements (e.g., n > 1000).
ii. Supporting R scripts with functions for fitting the linear regressions:
1. functions_fit_bayes_with_mcp.R
a. Use when running 3_fit_lm_bayes_mcp.R
b. Should be in folder called functions_fit_bayes
2. functions_fit_bayes_with_mcp_high_n_sites.R
a. Use when running 3_fit_lm_bayes_mcp_high_n_sites.R
b. Should be in folder called functions_fit_bayes
b. Data files required:
i. camels_chem_for_bayes_mcp_XX.csv
1. Note: XX above is the solute abbreviation
c. Resulting file(s):
i. XX_ fit_param_estimates.csv
ii. XX_ fit_comparison_metrics.csv
iii. XX_ fit_posterior_draws_subsample.csv
iv. XX_ fit_residuals.csv
v. XX_ plots_null_fit.pdf
vi. XX_ plots_null_chains.pdf
vii. XX_ plots_full_fit.pdf
viii. XX_ plots_full_chains.pdf
ix. XX_ plots_compare_null_to_full.pdf
1. Note: If running the high_n R scripts, the file names will include high_n_sites in the file names above after XX_
5. Repeat previous step for each solute of interest
6. Classify the C-Q patterns into 1 of 13 C-Q model classes or archetypes (see Underwood et al. 2017) based on the Bayesian regressions we fit previously in fit_lm_bayes_mcp.R
a. R script:
i. 4_classify_cq_pattern.R
1. Note: I often create a separate R script file for each solute I run and will include the solute abbreviation in the file name
b. Data files required:
i. XX_ fit_param_estimates.csv
ii. XX_ fit_comparison_metrics.csv
iii. camels_chem_for_bayes_mcp_XX.csv
c. Resulting file(s):
i. XX_plots_CQ_classifications.pdf
ii. XX_ CQ_data_and_classification.csv
7. Repeat previous step for each solute of interest


3. Hierarchical clustering of CAMELS catchment attributes
1. Use hierarchical clustering to cluster CAMELS gauges/catchments using CAMELS attributes
a. R script:
i. 5_hac_cluster_camels_catchments.R
b. Data files required:
i. All files ending in CQ_data_and_classification.csv in fit_results
ii. All .txt files from camels_attributes folder
1. These files were downloaded from https://gdex.ucar.edu/dataset/camels/file.html in March 2022
iii. table_of_camels_attributes.csv
iv. camels_chem_all_2022-02-25.csv
c. Resulting file(s):
i. camels_hac_clustered_all_attributes.csv

4. Feature selection for correlates of C-Q model class/archetype and slope value
1. Prepare data for random forest models
a. R script:
i. 6_random_forest_prep_data.R
b. Data files required:
i. All files ending in CQ_data_and_classification.csv in fit_results
ii. All .txt files from camels_attributes folder
1. These files were downloaded from https://gdex.ucar.edu/dataset/camels/file.html in March 2022
iii. table_of_camels_attributes.csv
iv. camels_chem_all_2022-02-25.csv
c. Resulting file(s):
i. camels_modClasses_CQparams_imputedCAMELSattrs_forRF.csv
2. Train random forest classification (CQ model classes) & regression (CQ slope) models to do feature selection on the CAMELS variables most important for predicting these response variables
a. R script:
i. 7_random_forest_feature_select.R
1. Note: I often create a separate R script file for each solute I run and will include the solute abbreviation in the file name
b. Data files required:
i. camels_modClasses_CQparams_imputedCAMELSattrs_forRF.csv
c. Resulting file(s):
i. XX_ rf_hyperparameters.csv
ii. XX_ rf_performance_metrics.csv
iii. XX_ rf_feature_importance.csv
3. Repeat previous step for each solute of interest

5. Flow-duration exceedance probability analysis
1. Estimate at what flow-duration exceedance probabilities that thresholds/breakpoints occur in the CQ relationships
a. R script:
i. 8_flow_duration_breakpoint_analysis.R
b. Data files required:
i. usgs_streamflow folder containing discharge data from CAMELS website
1. Downloaded from the CAMELS website (https://gdex.ucar.edu/dataset/camels/file.html) on 12/14/21
ii. All files ending in CQ_data_and_classification.csv in fit_results
c. Resulting files(s):
i. camels_cq_breakpoint_flowdur_perc.csv

7. Figures, tables, and statistics for manuscript
1. Create figures and do statistical analyses for Kincaid et al. 2023 WRR manuscript
a. R script:
i. 9_analyses_figures_for_manuscript.R
b. Data files required:
i. camels_chem_all_2022-02-25.csv
ii. All files ending in CQ_data_and_classification.csv in fit_results
iii. All files ending in metrics.csv in fit_results
iv. All .txt files from camels_attributes folder
1. These files were downloaded from https://gdex.ucar.edu/dataset/camels/file.html in March 2022
v. camels_hac_clustered_all_attributes.csv
vi. table_of_camels_attributes.csv
vii. USGS_gauge_info.csv
viii. camels_cq_breakpoint_flowdur_perc.csv
ix. camels_modClasses_CQparams_imputedCAMELSattrs_forRF.csv
x. moatar_etal_2017_wrr_data.csv
1. Moatar et al 2017 WRR supp info data
c. Resulting file(s):
i. table_s1_summary_conc_q_by_solute_and_cluster.csv
ii. table_s3_fpc_q_thresholds.csv
iii. table_s4_all_model_classifications.csv
iv. table_s5_summary_gauges_per_cluster_by_modclass_and_slope.csv
v. table_s5_summary_kw_slopes_by_cluster.csv
vi. table_s6_summary_of_b_cv.csv
vii. table_s7_summary_kw_slopes_by_cluster.csv
viii. table_s8_summary_kw_cvratio_by_cluster.csv
ix. fig_s2_usmap_cv_ratio.png
x. fig_s2_cv_ratio_by_cluster.png
xi. fig_1_prop_modclass_by_constit.png
xii. fig_s4_cq_thresholds_boxplot.png
xiii. fig_2_slopes_and_cv_example.png
xiv. fig_3a_plot_map_clusters_5.png
xv. fig_3b_all_zscores_clusters_5.png
xvi. fig_63a_attribute_values_1.png
xvii. fig_s6a_attribute_values_2.png
xviii. fig_s6b_attribute_cat_dist.png
xix. fig_4a_usmap_modclass.png
xx. fig_4b_modclass_by_cluster_legend.png
xxi. fig_4c_usmap_slope.png
xxii. fig_4d_slope_by_cluster.png
xxiii. fig_s7_mosaic_XX.png
1. One for each solute
xxiv. fig_s8_corr_slope_horiz.png
xxv. fig_5_feat_impt.png

Credits

Funding Agencies

This resource was created using funding from the following sources:
Agency Name Award Title Award Number
National Science Foundation NSF EAR-2012123
National Science Foundation NSF EAR-2012080
National Science Foundation NSF OIA-2033995

How to Cite

Kincaid, D., K. Underwood (2024). Code and Data for CAMELS-Chem Concentration-Discharge Analysis, HydroShare, https://doi.org/10.4211/hs.eddb06e91a914618a89a63bb2c2774e0

This resource is shared under the Creative Commons Attribution CC BY.

http://creativecommons.org/licenses/by/4.0/
CC-BY

Comments

There are currently no comments

New Comment

required