Load/Clean CSV and TXT Data File — loadData • baytrends

Load and clean comma delimited (*.csv) or tab delimited (*.txt) file and perform some rudimentary data cleaning.

loadData(
  file = NA,
  folder = ".",
  pk = NA,
  remDup = TRUE,
  remNAcol = TRUE,
  remNArow = TRUE,
  convDates = TRUE,
  tzSel = "America/New_York",
  commChar = "#",
  naChar = NA
)

Arguments

file: file (can use wildcards, e.g., "*.csv")
folder: folder (i.e., directory to look in, can use relative path )
pk: vector of columns that form the primary key for data set
remDup: logical field indicating whether duplicate rows are deleted
remNAcol: logical field indicating whether columns with all NA are deleted
remNArow: logical field indicating whether rows with all NA are deleted
convDates: vector or logical field indicating whether date-like columns should be converted to POSIXct format (see details)
tzSel: time zone to use for date conversions (default: "America/New_York")
commChar: character for comment line to be skipped
naChar: characters to treat as NA

Value

Returns data frame

Details

This function reads in a single comma delimited (*.csv) or tab delimited (*.txt) file using either utils::read.table or utils::read.csv based on the file extension. The user can use the wildcard feature for the file argument (e.g., file='*.csv') and the function will identify the most recently modified csv or txt file in the folder for importing.

Some specific features of this function include the following:

1. Leading '0's in character strings that would otherwise be trimmed and treated as numeric variables (e.g., USGS flow gages, state and county FIPS codes) are preserved. To effectively use this functionality, data maintained in a spreadsheet would be enclosed in quotes (e.g., "01578310"). When exported to csv or txt files the field would be in triple quotes (e.g., """01578310"""). Any column read in as integer is converted to numeric.

2. Rows and columns with no data (i.e., all NA) are deleted unless default settings for remNAcol and remNArow are changed to FALSE.

3. Completely duplicate rows are deleted unless default setting for remDup is changed to FALSE.

4. Rows beginning with '#' are skipped unless commChar set to ""

5. If a primary key (either single or multiple columns) is selected, the function enforces the primary key by deleting duplicate entries based on the primary key. Columns corresponding to the primary key (when specified) are moved to the first columns.

6. If convDates is a vector (i.e., c('beginDate', 'endDate')), then a date conversion is attempted for the corresponding columns found in the input file. If TRUE, then a date conversion is attempted for all columns found in the input file with 'date' in the name, If FALSE, no date conversion is attempted.

Some other common time zones include the following: America/New_York, America/Chicago, America/Denver, America/Los_Angeles, America/Anchorage, America/Honolulu, America/Jamaica, America/Managua, America/Phoenix, America/Metlakatla

A brief table reporting the results of the import are printed.

Note that columns containing just F, T, FALSE, TRUE are stored as logical fields