This function assesses the user supplied specifications in the argument, analySpec, and prepares the data (argument df) for analysis. In those cases where the user doesn't supply a needed specification, a basic option is supplied by this function.

analysisOrganizeData(
  df,
  analySpec = list(),
  reports = c(0, 1, 2, 3, 4),
  parameterList = NA,
  stationMasterList = NA,
  layerLukup = NA
)

Arguments

df

Data frame of water quality data

analySpec

Specifications for analysis

reports

Optional reports about parameters, layers and stations [default = c(0,1,2,3)]

parameterList

User-supplied dataframe with list of parameters [default = NA]

stationMasterList

User-supplied dataframe with list of stations [default = NA]

layerLukup

User-supplied dataframe with list of layers [default = NA]

Value

Returns a list. Use dfr[["df"]] and dfr[["analySpec"]] to extract updated data frame and updated (or created) analySpec. analySpec is a list that includes the following components:

analyTitle - Analysis trend title

parameterFilt - Parameter filter used for data down selection

stationFilt - Station filter used for data down selection

dateFilt - Date filter for data down selection (default = c( as.POSIXct('1984-01-01') , as.POSIXct(Sys.time())))

setTZ - time zone (default = "America/New_York")

layerFilt - Layer filter

layerAggOption - Layer averaging option (default = 0). Other options are: 1: combine "S" & "AP" ("SAP"); 2: combine "B" & "BP" ("BBP"); 3: opt 1 & 2 ("SAP", "BBP"); 4: combine all ("ALL")); 5: combine "S" and "B" ("SB")

obsMin - Minimum number of observations required to allow GAM analysis to proceed for a specific station, parameter, and layer combination (default = 60).

obsMinInter - Minimum number of observations required to allow GAM analysis to proceed for a specific intervention (default = 10).

gamAlpha - Alpha level used GAM analyses (default = 0.05).

censorTrim - Values to apply for trimming data due to too much censoring (default = c(0.5, 0.4)). First argument indicates fraction of observations in a year that are allowed to be censored. Second argument is the fraction of years, starting from the beginning of the record, that are allowed to be "flagged" for too much censoring. A minimum of two years must have too much censoring before data are removed. The default settings can be read as no more than 40 percent of the beginning years of data are allowed to have more than 50 percent censoring before the beginning portion of the record is trimmed. For example, years 1 and 3 of data have more than 50 percent censoring then the first three years of data are trimmed. Similarly, for years 1 and 4, then the first four years are removed. If years 1 and 5 have more than 50 percent censoring the data are kept since 2/5 is not greater than 0.4. Changing this setting to, say, c(0.2,0.4) would require that 80

gamModels - model formulations. See baytrends::loadModels() for simplified approach for selecting which built-in models to include

showGamNumOnPlot - Show gam option (i.e., 0-6) on gam plots (TRUE/FALSE)

gamDiffPeriods - list of time periods (years) used for computing changes (differences). The default options are: full record, 1999/00-present, and 2005/06-present.

gamDiffSeasons - list List of seasons used for sub-annual analyses of computing differences. The default options include the following: All (months 1:12), Spring1 (3:5), Spring2 (months: 4:6)), Summer1 (months: 6:9)), Summer2 (months: 7:9)), SAV1 (months: 4:10)), SAV2 (months: 3:5,9:11)), Winter (months: 1:2)), and Fall (months: 10:12))).

gamDiffNumChgYrs - number of years to use in computing differences.

gamPenalty - allow the user to set the mgcv::gam select argument to TRUE, FALSE, or baytrend algorithm default (default = NA). When the default option is specified, then the mgcv::gam select argument is set to FALSE when none of the data are censored; otherwise (when some censoring exists in the data set) the select argument is set to TRUE

gamPenaltyCrit - edf and F-stat values used to flag ANOVA table results (default = c(1, 9e9))

gamCoeffDeltaMaxCrit - convergence criteria for expectation maximization (default = 1e-6)

gamFlw_Sal.Wgt.Perc - percentiles of flow [or salinity] to use for computing flow [salinity] averaged result (default = c(0.05, 0.25, 0.50, 0.75, 0.95))

gamLegend - default settings for gam figure legend

idVar - primary key for data frame returned as df

depVarList - data frame of dependent variables (useful for setting up gam analyses in for loops)

stationList - data frame of stations (useful for setting up gam analyses in for loops)

layerList - data frame of layers (useful for setting up gam analyses in for loops)

Details

The supplied data frame, df, is a data frame with the variables station, date, and layer included along with multiple additional columns for a variety of water quality variables structured as numeric fields or survival::Surv objects. An example data frame, dataCensored, is included with baytrends as an example.

The argument, analySpec, is a list that includes basic specifications for performing GAM analyses. The components in analySpec are identified below. The user may create analySpec (which can include all or some of the below components; and pass the user-supplied analySpec to this function. Or, the user may accept the default argument and allow analysisOrganizeData to create and return analySpec. If the user passes a user-supplied analySpec, then analysisOrganizeData will "fill in" required arguments not provided by the user. The user can also adjust analySpec after it is returned from analysisOrganizeData although requirements for down selecting the data frame, df, or aggregating data by layer would need to be passed to analysisOrganizeData.

The default setting for the argument report is to provide tabular summary reports about the parameters, stations, and layers to be analyzed. Setting report=NA will suppress the tabular summary reports.

The user can supply their own parameterList, stationMasterList, or layerLukup data sets; or the user can use the example data frames included with baytrends.

The following steps are performed by analysisOrganizeData:

1) Review user supplied specifications through the analySpec argument. Fill in with default values. For example, if the user doesn't specify a dataframe with a list of stations, parameters, or layers, then the built-in data frames, baytrends::stationMasterList, baytrends::parameterList, and baytrends::layerLukup are used. Some other default values include the following: date range (1/1/1984-present), parameter list (all parameters in data set parameterList), layers (all layers in data set layerLukup), layer aggregation option (0, no aggregation), minimum number of observations to perform a GAM analysis (60), GAM formulas (Linear Trend with Seasonality, Non-linear Trend with Seasonality, and Non-linear trend with Seasonality (plus Interactions)), GAM alpha level for plots and output (0.05), periods for multi-time period analyses using GAM (Full Record, 1999/00-Present, and 2005/06-Present), and seasons for sub-annual analyses using GAM (All, Spring1, Spring2, Summer1, Summer2, SAV1, SAV2, Winter, and Fall.)

2) Based on the settings in analySpec, the data frame df, is down selected based on parameters, stations, dates, and layers. A dependent variable list (depVarList) is created that includes variable descriptions, units, and log transformation selections. A station list (stationList) is created that includes the station ID, a selected USGS gage for correlating flow, and latitude/longitude.

3) Aggregate data layers. If analySpec$layerAggOption is equal to 0, then there is no aggregation. The analySpec$layerAggOption of 1 would result in averaging (mean) surface and above pycnocline data. In this example, records with layer = "S" and layer = "AP" are relabeled as "SAP". Other layerAggOption values are 2) "B"&"BP"; 3) "S"&"AP" and "B"&"BP"; 4) all layers; and 5) "S"&"B", respectively. A layer list (layerList) is created and returned.

4) Data are then averaged (mean) by date.

5) Date features are added. Columns for year, day of year (doy), decimal year (dyear), and month are added based on date. Note that the doy is based on a 366 day calendar regardless of leap year.

6) Reports on the number of records (0), parameters (1), layers (2) and stations (3) can be controlled with the reports option.

Examples

# run analysis relying on default specifications, examine analySpec for
# default options
dfr <- analysisOrganizeData(dataCensored)
#> 
#> 
#>  
#> 
#> 
#> ### Record Count
#> 
#> 
#>  
#> 
#> `Beginning Number of Records:  13062`
#> 
#>  
#> 
#> `Number of Records After Processing:  13062`
#> 
#>  
#> 
#> 
#> ### Parameters
#> 
#> 
#>  
#> 
#> *Table: List of Parameters.*
#> 
#>  
#> 
#> 
#> 
#> Dep. Var.  Parameter Name                Units  Log Tran.  GAM Dep. Var.
#> ---------  ----------------------------  -----  ---------  -------------
#> secchi     Secchi Depth                  m      FALSE      secchi       
#> chla       Chlorophyll a (Corrected)     ug/L   TRUE       lnchla       
#> do         Dissolved Oxygen              mg/L   FALSE      do           
#> tn         Total Nitrogen                mg/L   TRUE       lntn         
#> tp         Total Phosphorus              mg/L   TRUE       lntp         
#> po4        Orthophosphorus               mg/L   TRUE       lnpo4        
#> tdp        Total Dissolved Phosphorus    mg/L   TRUE       lntdp        
#> no23       Nitrite + Nitrate             mg/L   TRUE       lnno23       
#> nh4        Ammonium                      mg/L   TRUE       lnnh4        
#> tdn        Total Dissolved Nitrogen      mg/L   TRUE       lntdn        
#> din        Dissolved Inorganic Nitrogen  mg/L   TRUE       lndin        
#> salinity   Salinity                      ppt    FALSE      salinity     
#> tss        Total Suspended Solids        mg/L   TRUE       lntss        
#> wtemp      Water Temperature             deg C  FALSE      wtemp        
#> 
#> ### Layers
#> 
#> 
#>  
#> 
#> *Table: List of Layers.*
#> 
#>  
#> 
#> 
#> 
#> Layer ID  Layer Name      
#> --------  ----------------
#> S         Surface         
#> AP        Above Pycnocline
#> BP        Below Pycnocline
#> B         Bottom          
#> 
#> ### Models
#> 
#> 
#>  
#> 
#> *Table: List of Models.*
#> 
#>  
#> 
#> 
#> 
#> Option  Model                                         
#> ------  ----------------------------------------------
#>      0  Linear Trend with Seasonality                 
#>      1  Non-linear Trend with Seasonality             
#>      2  Non-linear trend with Seas+Int                
#>      3  Non-linear trend with Seas+Int. & Intervention
#>      4  Non-linear trend with Seas+Int. & Hydro Adj   
#> 
#> ## 
#> 
#> ### Stations
#> 
#> 
#>  
#> 
#> *Table: List of Stations.*
#> 
#>  
#> 
#> 
#> 
#> Station ID  Latitude  Longitude  CB 92 Seg.  Flow Adj. Gage  Mth. Group
#> ----------  --------  ---------  ----------  --------------  ----------
#> CB3.3C       38.9960   -76.3597  CB3MH       01578310        MD-Main   
#> CB4.1C       38.8259   -76.3994  CB4MH       01578310        MD-Main   
#> CB5.4        37.8001   -76.1747  CB5MH_VA    01578310        VA-All    
#> TF5.5        37.3126   -77.2328  JMSTF1      02035000        VA-All    
#> EE2.1        38.6549   -76.2643  CHOMH1      01491000        MD-Trib   
#> EE3.0        38.2809   -76.0103  FSBMH       01578310        MD-Trib   
#> TF2.2        38.6907   -77.1111  POTTF_MD    01646500        MD-Potomac
#> LE2.2        38.1576   -76.5980  POTMH_MD    01646500        MD-Potomac
df        <- dfr[["df"]]
analySpec <- dfr[["analySpec"]]

# analyze bottom dissolved oxygen at 2 stations using only data from 1/1/1995-12/31/2015
analySpec <-list()
analySpec$parameterFilt <- c('do')
analySpec$layerFilt     <- c('B')
analySpec$stationFilt   <- c('CB3.3C', 'CB5.4')
analySpec$dateFilt      <- as.POSIXct(c("1995-01-01", "2015-12-31"))
dfr <- analysisOrganizeData(dataCensored, analySpec)
#> 
#> 
#>  
#> 
#> 
#> ### Record Count
#> 
#> 
#>  
#> 
#> `Beginning Number of Records:  13062`
#> 
#>  
#> 
#> `Number of Records After Processing:  608`
#> 
#>  
#> 
#> 
#> ### Parameters
#> 
#> 
#>  
#> 
#> *Table: List of Parameters.*
#> 
#>  
#> 
#> 
#> 
#> Dep. Var.  Parameter Name    Units  Log Tran.  GAM Dep. Var.
#> ---------  ----------------  -----  ---------  -------------
#> do         Dissolved Oxygen  mg/L   FALSE      do           
#> 
#> ### Layers
#> 
#> 
#>  
#> 
#> *Table: List of Layers.*
#> 
#>  
#> 
#> 
#> 
#> Layer ID  Layer Name
#> --------  ----------
#> B         Bottom    
#> 
#> ### Models
#> 
#> 
#>  
#> 
#> *Table: List of Models.*
#> 
#>  
#> 
#> 
#> 
#> Option  Model                                         
#> ------  ----------------------------------------------
#>      0  Linear Trend with Seasonality                 
#>      1  Non-linear Trend with Seasonality             
#>      2  Non-linear trend with Seas+Int                
#>      3  Non-linear trend with Seas+Int. & Intervention
#>      4  Non-linear trend with Seas+Int. & Hydro Adj   
#> 
#> ## 
#> 
#> ### Stations
#> 
#> 
#>  
#> 
#> *Table: List of Stations.*
#> 
#>  
#> 
#> 
#> 
#> Station ID  Latitude  Longitude  CB 92 Seg.  Flow Adj. Gage  Mth. Group
#> ----------  --------  ---------  ----------  --------------  ----------
#> CB3.3C       38.9960   -76.3597  CB3MH       01578310        MD-Main   
#> CB5.4        37.8001   -76.1747  CB5MH_VA    01578310        VA-All    
df        <- dfr[["df"]]
analySpec <- dfr[["analySpec"]]