This function assesses the user supplied specifications in the argument, analySpec, and prepares the data (argument df) for analysis. In those cases where the user doesn't supply a needed specification, a basic option is supplied by this function.
Data frame of water quality data
Specifications for analysis
Optional reports about parameters, layers and stations [default = c(0,1,2,3)]
User-supplied dataframe with list of parameters [default = NA]
User-supplied dataframe with list of stations [default = NA]
User-supplied dataframe with list of layers [default = NA]
Returns a list. Use dfr[["df"]] and dfr[["analySpec"]] to extract updated data frame and updated (or created) analySpec. analySpec is a list that includes the following components:
analyTitle - Analysis trend title
parameterFilt - Parameter filter used for data down selection
stationFilt - Station filter used for data down selection
dateFilt - Date filter for data down selection (default = c( as.POSIXct('1984-01-01') , as.POSIXct(Sys.time())))
setTZ - time zone (default = "America/New_York")
layerFilt - Layer filter
layerAggOption - Layer averaging option (default = 0). Other options are: 1: combine "S" & "AP" ("SAP"); 2: combine "B" & "BP" ("BBP"); 3: opt 1 & 2 ("SAP", "BBP"); 4: combine all ("ALL")); 5: combine "S" and "B" ("SB")
obsMin - Minimum number of observations required to allow GAM analysis to proceed for a specific station, parameter, and layer combination (default = 60).
obsMinInter - Minimum number of observations required to allow GAM analysis to proceed for a specific intervention (default = 10).
gamAlpha - Alpha level used GAM analyses (default = 0.05).
censorTrim - Values to apply for trimming data due to too much censoring (default = c(0.5, 0.4)). First argument indicates fraction of observations in a year that are allowed to be censored. Second argument is the fraction of years, starting from the beginning of the record, that are allowed to be "flagged" for too much censoring. A minimum of two years must have too much censoring before data are removed. The default settings can be read as no more than 40 percent of the beginning years of data are allowed to have more than 50 percent censoring before the beginning portion of the record is trimmed. For example, years 1 and 3 of data have more than 50 percent censoring then the first three years of data are trimmed. Similarly, for years 1 and 4, then the first four years are removed. If years 1 and 5 have more than 50 percent censoring the data are kept since 2/5 is not greater than 0.4. Changing this setting to, say, c(0.2,0.4) would require that 80
gamModels - model formulations. See baytrends::loadModels() for simplified approach for selecting which built-in models to include
showGamNumOnPlot - Show gam option (i.e., 0-6) on gam plots (TRUE/FALSE)
gamDiffPeriods - list of time periods (years) used for computing changes (differences). The default options are: full record, 1999/00-present, and 2005/06-present.
gamDiffSeasons - list List of seasons used for sub-annual analyses of computing differences. The default options include the following: All (months 1:12), Spring1 (3:5), Spring2 (months: 4:6)), Summer1 (months: 6:9)), Summer2 (months: 7:9)), SAV1 (months: 4:10)), SAV2 (months: 3:5,9:11)), Winter (months: 1:2)), and Fall (months: 10:12))).
gamDiffNumChgYrs - number of years to use in computing differences.
gamPenalty - allow the user to set the mgcv::gam select argument to TRUE, FALSE, or baytrend algorithm default (default = NA). When the default option is specified, then the mgcv::gam select argument is set to FALSE when none of the data are censored; otherwise (when some censoring exists in the data set) the select argument is set to TRUE
gamPenaltyCrit - edf and F-stat values used to flag ANOVA table results (default = c(1, 9e9))
gamCoeffDeltaMaxCrit - convergence criteria for expectation maximization (default = 1e-6)
gamFlw_Sal.Wgt.Perc - percentiles of flow [or salinity] to use for computing flow [salinity] averaged result (default = c(0.05, 0.25, 0.50, 0.75, 0.95))
gamLegend - default settings for gam figure legend
idVar - primary key for data frame returned as df
depVarList - data frame of dependent variables (useful for setting up gam analyses in for loops)
stationList - data frame of stations (useful for setting up gam analyses in for loops)
layerList - data frame of layers (useful for setting up gam analyses in for loops)
The supplied data frame, df, is a data frame with the variables station, date, and layer included along with multiple additional columns for a variety of water quality variables structured as numeric fields or survival::Surv objects. An example data frame, dataCensored, is included with baytrends as an example.
The argument, analySpec, is a list that includes basic specifications for performing GAM analyses. The components in analySpec are identified below. The user may create analySpec (which can include all or some of the below components; and pass the user-supplied analySpec to this function. Or, the user may accept the default argument and allow analysisOrganizeData to create and return analySpec. If the user passes a user-supplied analySpec, then analysisOrganizeData will "fill in" required arguments not provided by the user. The user can also adjust analySpec after it is returned from analysisOrganizeData although requirements for down selecting the data frame, df, or aggregating data by layer would need to be passed to analysisOrganizeData.
The default setting for the argument report is to provide tabular summary reports about the parameters, stations, and layers to be analyzed. Setting report=NA will suppress the tabular summary reports.
The user can supply their own parameterList, stationMasterList, or layerLukup data sets; or the user can use the example data frames included with baytrends.
The following steps are performed by analysisOrganizeData:
1) Review user supplied specifications through the analySpec argument. Fill in with default values. For example, if the user doesn't specify a dataframe with a list of stations, parameters, or layers, then the built-in data frames, baytrends::stationMasterList, baytrends::parameterList, and baytrends::layerLukup are used. Some other default values include the following: date range (1/1/1984-present), parameter list (all parameters in data set parameterList), layers (all layers in data set layerLukup), layer aggregation option (0, no aggregation), minimum number of observations to perform a GAM analysis (60), GAM formulas (Linear Trend with Seasonality, Non-linear Trend with Seasonality, and Non-linear trend with Seasonality (plus Interactions)), GAM alpha level for plots and output (0.05), periods for multi-time period analyses using GAM (Full Record, 1999/00-Present, and 2005/06-Present), and seasons for sub-annual analyses using GAM (All, Spring1, Spring2, Summer1, Summer2, SAV1, SAV2, Winter, and Fall.)
2) Based on the settings in analySpec, the data frame df, is down selected based on parameters, stations, dates, and layers. A dependent variable list (depVarList) is created that includes variable descriptions, units, and log transformation selections. A station list (stationList) is created that includes the station ID, a selected USGS gage for correlating flow, and latitude/longitude.
3) Aggregate data layers. If analySpec$layerAggOption is equal to 0, then there is no aggregation. The analySpec$layerAggOption of 1 would result in averaging (mean) surface and above pycnocline data. In this example, records with layer = "S" and layer = "AP" are relabeled as "SAP". Other layerAggOption values are 2) "B"&"BP"; 3) "S"&"AP" and "B"&"BP"; 4) all layers; and 5) "S"&"B", respectively. A layer list (layerList) is created and returned.
4) Data are then averaged (mean) by date.
5) Date features are added. Columns for year, day of year (doy), decimal year (dyear), and month are added based on date. Note that the doy is based on a 366 day calendar regardless of leap year.
6) Reports on the number of records (0), parameters (1), layers (2) and stations (3) can be controlled with the reports option.
# run analysis relying on default specifications, examine analySpec for
# default options
dfr <- analysisOrganizeData(dataCensored)
#>
#>
#>
#>
#>
#> ### Record Count
#>
#>
#>
#>
#> `Beginning Number of Records: 13062`
#>
#>
#>
#> `Number of Records After Processing: 13062`
#>
#>
#>
#>
#> ### Parameters
#>
#>
#>
#>
#> *Table: List of Parameters.*
#>
#>
#>
#>
#>
#> Dep. Var. Parameter Name Units Log Tran. GAM Dep. Var.
#> --------- ---------------------------- ----- --------- -------------
#> secchi Secchi Depth m FALSE secchi
#> chla Chlorophyll a (Corrected) ug/L TRUE lnchla
#> do Dissolved Oxygen mg/L FALSE do
#> tn Total Nitrogen mg/L TRUE lntn
#> tp Total Phosphorus mg/L TRUE lntp
#> po4 Orthophosphorus mg/L TRUE lnpo4
#> tdp Total Dissolved Phosphorus mg/L TRUE lntdp
#> no23 Nitrite + Nitrate mg/L TRUE lnno23
#> nh4 Ammonium mg/L TRUE lnnh4
#> tdn Total Dissolved Nitrogen mg/L TRUE lntdn
#> din Dissolved Inorganic Nitrogen mg/L TRUE lndin
#> salinity Salinity ppt FALSE salinity
#> tss Total Suspended Solids mg/L TRUE lntss
#> wtemp Water Temperature deg C FALSE wtemp
#>
#> ### Layers
#>
#>
#>
#>
#> *Table: List of Layers.*
#>
#>
#>
#>
#>
#> Layer ID Layer Name
#> -------- ----------------
#> S Surface
#> AP Above Pycnocline
#> BP Below Pycnocline
#> B Bottom
#>
#> ### Models
#>
#>
#>
#>
#> *Table: List of Models.*
#>
#>
#>
#>
#>
#> Option Model
#> ------ ----------------------------------------------
#> 0 Linear Trend with Seasonality
#> 1 Non-linear Trend with Seasonality
#> 2 Non-linear trend with Seas+Int
#> 3 Non-linear trend with Seas+Int. & Intervention
#> 4 Non-linear trend with Seas+Int. & Hydro Adj
#>
#> ##
#>
#> ### Stations
#>
#>
#>
#>
#> *Table: List of Stations.*
#>
#>
#>
#>
#>
#> Station ID Latitude Longitude CB 92 Seg. Flow Adj. Gage Mth. Group
#> ---------- -------- --------- ---------- -------------- ----------
#> CB3.3C 38.9960 -76.3597 CB3MH 01578310 MD-Main
#> CB4.1C 38.8259 -76.3994 CB4MH 01578310 MD-Main
#> CB5.4 37.8001 -76.1747 CB5MH_VA 01578310 VA-All
#> TF5.5 37.3126 -77.2328 JMSTF1 02035000 VA-All
#> EE2.1 38.6549 -76.2643 CHOMH1 01491000 MD-Trib
#> EE3.0 38.2809 -76.0103 FSBMH 01578310 MD-Trib
#> TF2.2 38.6907 -77.1111 POTTF_MD 01646500 MD-Potomac
#> LE2.2 38.1576 -76.5980 POTMH_MD 01646500 MD-Potomac
df <- dfr[["df"]]
analySpec <- dfr[["analySpec"]]
# analyze bottom dissolved oxygen at 2 stations using only data from 1/1/1995-12/31/2015
analySpec <-list()
analySpec$parameterFilt <- c('do')
analySpec$layerFilt <- c('B')
analySpec$stationFilt <- c('CB3.3C', 'CB5.4')
analySpec$dateFilt <- as.POSIXct(c("1995-01-01", "2015-12-31"))
dfr <- analysisOrganizeData(dataCensored, analySpec)
#>
#>
#>
#>
#>
#> ### Record Count
#>
#>
#>
#>
#> `Beginning Number of Records: 13062`
#>
#>
#>
#> `Number of Records After Processing: 608`
#>
#>
#>
#>
#> ### Parameters
#>
#>
#>
#>
#> *Table: List of Parameters.*
#>
#>
#>
#>
#>
#> Dep. Var. Parameter Name Units Log Tran. GAM Dep. Var.
#> --------- ---------------- ----- --------- -------------
#> do Dissolved Oxygen mg/L FALSE do
#>
#> ### Layers
#>
#>
#>
#>
#> *Table: List of Layers.*
#>
#>
#>
#>
#>
#> Layer ID Layer Name
#> -------- ----------
#> B Bottom
#>
#> ### Models
#>
#>
#>
#>
#> *Table: List of Models.*
#>
#>
#>
#>
#>
#> Option Model
#> ------ ----------------------------------------------
#> 0 Linear Trend with Seasonality
#> 1 Non-linear Trend with Seasonality
#> 2 Non-linear trend with Seas+Int
#> 3 Non-linear trend with Seas+Int. & Intervention
#> 4 Non-linear trend with Seas+Int. & Hydro Adj
#>
#> ##
#>
#> ### Stations
#>
#>
#>
#>
#> *Table: List of Stations.*
#>
#>
#>
#>
#>
#> Station ID Latitude Longitude CB 92 Seg. Flow Adj. Gage Mth. Group
#> ---------- -------- --------- ---------- -------------- ----------
#> CB3.3C 38.9960 -76.3597 CB3MH 01578310 MD-Main
#> CB5.4 37.8001 -76.1747 CB5MH_VA 01578310 VA-All
df <- dfr[["df"]]
analySpec <- dfr[["analySpec"]]