Load data for ampvis2 functions

This function reads an OTU-table and corresponding sample metadata, and returns a list for use in all ampvis2 functions. It is therefore required to load data with amp_load before any other ampvis2 functions can be used.

amp_load(
  otutable,
  metadata = NULL,
  taxonomy = NULL,
  fasta = NULL,
  tree = NULL,
  pruneSingletons = FALSE,
  removeAbsentOTUs = TRUE,
  otutable_OTUcolname = c("OTU", "ASV", "#OTU ID"),
  taxonomy_OTUcolname = c("OTU", "ASV", "#OTU ID"),
  ...
)

Arguments

otutable: (required) File path, data frame, or a phyloseq-class object. OTU-table with the read counts of all OTU's. Rows are OTU's, columns are samples, otherwise you must transpose. The taxonomy of the OTU's can be placed anywhere in the table and will be extracted by name (Kingdom/Domain -> Species). If a file path is provided it will be attempted being read by either fread or read_excel, respectively. Compressed files (zip, bzip2, gzip) are supported if not an excel file (bzip2 and gzip requires data.table 1.14.3 or later). Can also be a path to a BIOM file, which will then be parsed using the biomformat package, so both the JSON and HDF5 versions of the BIOM format are supported.
metadata: (recommended) File path or a data frame. Sample metadata with any information about the samples. The first column must contain sample ID's matching those in the otutable. If none provided, dummy metadata will be created. Can be a data frame, matrix, or path to a delimited text file or excel file which will be read using either fread or read_excel, respectively. Compressed files (zip, bzip2, gzip) are supported if not an excel file (bzip2 and gzip requires data.table 1.14.3 or later). If otutable is a BIOM file and contains sample metadata, metadata will take precedence if provided. (default: NULL)
taxonomy: (recommended) File path or a data frame. Taxonomy table where rows are OTU's and columns are up to 7 levels of taxonomy named Kingdom/Domain->Species. If taxonomy is also present in otutable, it will be discarded and only this will be used. Can be a data frame, matrix, or path to a delimited text file or excel file which will be read using either fread or read_excel, respectively. Compressed files (zip, bzip2, gzip) are supported if not an excel file (bzip2 and gzip requires data.table 1.14.3 or later). Can also be a path to a .sintax taxonomy table from a USEARCH analysis pipeline, file extension must be .sintax. bzip2 or gzip compression is currently NOT supported if sintax format. (default: NULL)
fasta: (optional) Path to a FASTA file with reference sequences for all OTU's in the OTU-table. (default: NULL)
tree: (optional) Path to a phylogenetic tree file which will be read using read.tree, or an object of class "phylo". (default: NULL)
pruneSingletons: (logical) Remove OTU's only observed once in all samples. (default: FALSE)
removeAbsentOTUs: (logical) Remove OTU's with 0 abundance in all samples. Absent OTUs are rarely present in the input data itself, but can occur when some samples are removed because of a mismatch between samples in the OTU-table and sample metadata. (default: TRUE)
otutable_OTUcolname: Character vector with the name(s) of the column in the otutable that contains the OTUs/ASVs.
taxonomy_OTUcolname: Character vector with the name(s) of the column in the taxonomy that contains the OTUs/ASVs.
...: (optional) Additional arguments are passed on to any of the file reader functions used.

Value

A list of class "ampvis2" with 3 to 5 elements.

Details

The amp_load function validates and corrects the provided data frames in different ways to make it suitable for the rest of the ampvis2 functions. It is important that the provided data frames match the requirements as described in the following sections to work properly. If a phyloseq-class object is provided the metadata, taxonomy, fasta, and tree arguments are ignored as they are expected to be provided in the phyloseq object.

The OTU-table

The OTU-table contains information about the OTUs, their read counts in each sample, and optionally their assigned taxonomy. The provided OTU-table must be a data frame with the following requirements:

The rows are OTU IDs and the columns are samples.
The last 7 columns are optionally the corresponding taxonomy assigned to the OTUs, named "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species".
The OTU ID's are expected to be in either the row names of the data frame or in a column called "OTU", "ASV", or "#OTU ID". Otherwise the function will stop with a message.
The column names of the data frame are the sample IDs, exactly matching those in the metadata, (and taxonomy columns named Kingdom -> Species if present, of course).
Generally avoid special characters and spaces in row- and column names.

A minimal example is available with data("example_otutable").

The metadata

The metadata contains additional information about the samples, for example where each sample was taken, date, pH, treatment etc, which is used to compare and group the samples during analysis. The amount of information in the metadata is unlimited, it can contain any number of columns (variables), however there are a few requirements:

The sample IDs must be in the first column. These sample IDs must match exactly to those in the OTU-table.
Column classes matter, categorical variables should be loaded either as.character() or as.factor(), and continuous variables as.numeric(). See below.
Generally avoid special characters and spaces in row- and column names.

If for example a column is named "Year" and the entries are simply entered as numbers (2011, 2012, 2013 etc), then R will automatically consider these as numerical values (as.numeric()) and therefore the column as a continuous variable, while it is a categorical variable and should be loaded as.factor() or as.character() instead. This has consequences for the analysis as R treats them differently. Therefore either use the colClasses = argument when loading a csv file or col_types = when loading an excel file, or manually adjust the column classes afterwards with fx metadata$Year <- as.character(metadata$Year).

The amp_load function will automatically use the sample IDs in the first column as row names, but it is important to also have an actual column with sample IDs, so it's possible to fx group by that column during analysis. Any unmatched samples between the otutable and metadata will be removed with a warning.

A minimal example is available with data("example_metadata").

Author

Kasper Skytte Andersen ksa@bio.aau.dk

Mads Albertsen MadsAlbertsen85@gmail.com

Examples


library(ampvis2)
if (FALSE) {
# Load data by either giving file paths or by passing already loaded R objects
### example load with file paths
d <- amp_load(
  otutable = "path/to/otutable.tsv",
  metadata = "path/to/metadata.xlsx",
  taxonomy = "path/to/taxonomy.txt"
)

### example load with R objects
# Read the OTU-table as a data frame. It is important to set check.names = FALSE
myotutable <- read.delim("data/otutable.txt", check.names = FALSE)

# Read the metadata, probably an excel sheet
mymetadata <- read_excel("data/metadata.xlsx", col_names = TRUE)

# Read the taxonomy
mytaxonomy <- read.csv("data/taxonomy.csv", check.names = FALSE)

# Combine the data with amp_load()
d <- amp_load(
  otutable = myotutable,
  metadata = mymetadata,
  taxonomy = mytaxonomy,
  pruneSingletons = FALSE,
  fasta = "path/to/fastafile.fa", # optional
  tree = "path/to/tree.tree" # optional
)

### Load a phyloseq object
d <- amp_load(physeq_object)

### Show a short summary about the data by simply typing the name of the object in the console
d
}

### Minimal example metadata:
data("example_metadata")
example_metadata
#> # A tibble: 8 × 4
#>   SampleID    Plant     Date                 Year
#>   <chr>       <chr>     <dttm>              <dbl>
#> 1 16SAMP_3893 Aalborg E 2014-02-06 00:00:00  2014
#> 2 16SAMP_3913 Aalborg E 2014-07-03 00:00:00  2014
#> 3 16SAMP_3941 Aalborg E 2014-08-19 00:00:00  2014
#> 4 16SAMP_3946 Aalborg E 2014-11-13 00:00:00  2014
#> 5 16SAMP_3953 Aalborg W 2014-02-04 00:00:00  2014
#> 6 16SAMP_4591 Aalborg W 2014-05-05 00:00:00  2014
#> 7 16SAMP_4597 Aalborg W 2014-08-18 00:00:00  2014
#> 8 16SAMP_4603 Aalborg W 2014-11-12 00:00:00  2014

### Minimal example otutable:
data("example_otutable")
example_otutable
#>        16SAMP_3893 16SAMP_3913 16SAMP_3941 16SAMP_3946 16SAMP_3953 16SAMP_4591
#> OTU_1           23          15         273          51         127         190
#> OTU_2          675         565         331         411         430         780
#> OTU_3          780         733         405         199        1346        1114
#> OTU_4          272         233        1434         256         736        1338
#> OTU_5          560         339         509         598         223         145
#> OTU_6          906         766         133         390         232        1458
#> OTU_7          297         218         418         130        1354         198
#> OTU_8           28           8         155          72         156         101
#> OTU_9            0           0           9           0          19          25
#> OTU_10         373         256          19         415          43         102
#>        16SAMP_4597 16SAMP_4603     Kingdom            Phylum
#> OTU_1          220          83 k__Bacteria    p__Chloroflexi
#> OTU_2          699         820 k__Bacteria p__Actinobacteria
#> OTU_3         1630         112 k__Bacteria p__Actinobacteria
#> OTU_4         1224         564 k__Bacteria p__Proteobacteria
#> OTU_5          212        1619 k__Bacteria    p__Chloroflexi
#> OTU_6          560         287 k__Bacteria     p__Firmicutes
#> OTU_7          283         116 k__Bacteria p__Actinobacteria
#> OTU_8          151          25 k__Bacteria    p__Nitrospirae
#> OTU_9           58           0 k__Bacteria  p__Bacteroidetes
#> OTU_10          73         138 k__Bacteria  p__Bacteroidetes
#>                        Class                 Order                Family
#> OTU_1              c__SJA-15           o__C10_SB1A           f__C10_SB1A
#> OTU_2      c__Actinobacteria      o__Micrococcales f__Intrasporangiaceae
#> OTU_3      c__Acidimicrobiia   o__Acidimicrobiales    f__Microthricaceae
#> OTU_4  c__Betaproteobacteria      o__Rhodocyclales     f__Rhodocyclaceae
#> OTU_5        c__Anaerolineae     o__Anaerolineales    f__Anaerolineaceae
#> OTU_6             c__Bacilli    o__Lactobacillales  f__Carnobacteriaceae
#> OTU_7      c__Acidimicrobiia   o__Acidimicrobiales    f__Microthricaceae
#> OTU_8          c__Nitrospira      o__Nitrospirales     f__Nitrospiraceae
#> OTU_9    c__Sphingobacteriia o__Sphingobacteriales     f__Saprospiraceae
#> OTU_10   c__Sphingobacteriia o__Sphingobacteriales     f__Saprospiraceae
#>                              Genus         Species    OTU
#> OTU_1     g__Candidatus Amarilinum             s__  OTU_1
#> OTU_2              g__Tetrasphaera             s__  OTU_2
#> OTU_3     g__Candidatus Microthrix             s__  OTU_3
#> OTU_4             g__Dechloromonas             s__  OTU_4
#> OTU_5  g__Candidatus Villogracilis             s__  OTU_5
#> OTU_6              g__Trichococcus             s__  OTU_6
#> OTU_7     g__Candidatus Microthrix             s__  OTU_7
#> OTU_8                g__Nitrospira s__sublineage I  OTU_8
#> OTU_9                 g__QEDR3BF09             s__  OTU_9
#> OTU_10                     g__MK04             s__ OTU_10

### Minimal example taxonomy:
data("example_taxonomy")
example_taxonomy
#>            Kingdom            Phylum                 Class
#> OTU_1  k__Bacteria    p__Chloroflexi             c__SJA-15
#> OTU_2  k__Bacteria p__Actinobacteria     c__Actinobacteria
#> OTU_3  k__Bacteria p__Actinobacteria     c__Acidimicrobiia
#> OTU_4  k__Bacteria p__Proteobacteria c__Betaproteobacteria
#> OTU_5  k__Bacteria    p__Chloroflexi       c__Anaerolineae
#> OTU_6  k__Bacteria     p__Firmicutes            c__Bacilli
#> OTU_7  k__Bacteria p__Actinobacteria     c__Acidimicrobiia
#> OTU_8  k__Bacteria    p__Nitrospirae         c__Nitrospira
#> OTU_9  k__Bacteria  p__Bacteroidetes   c__Sphingobacteriia
#> OTU_10 k__Bacteria  p__Bacteroidetes   c__Sphingobacteriia
#>                        Order                Family                       Genus
#> OTU_1            o__C10_SB1A           f__C10_SB1A    g__Candidatus Amarilinum
#> OTU_2       o__Micrococcales f__Intrasporangiaceae             g__Tetrasphaera
#> OTU_3    o__Acidimicrobiales    f__Microthricaceae    g__Candidatus Microthrix
#> OTU_4       o__Rhodocyclales     f__Rhodocyclaceae            g__Dechloromonas
#> OTU_5      o__Anaerolineales    f__Anaerolineaceae g__Candidatus Villogracilis
#> OTU_6     o__Lactobacillales  f__Carnobacteriaceae             g__Trichococcus
#> OTU_7    o__Acidimicrobiales    f__Microthricaceae    g__Candidatus Microthrix
#> OTU_8       o__Nitrospirales     f__Nitrospiraceae               g__Nitrospira
#> OTU_9  o__Sphingobacteriales     f__Saprospiraceae                g__QEDR3BF09
#> OTU_10 o__Sphingobacteriales     f__Saprospiraceae                     g__MK04
#>                Species    OTU
#> OTU_1              s__  OTU_1
#> OTU_2              s__  OTU_2
#> OTU_3              s__  OTU_3
#> OTU_4              s__  OTU_4
#> OTU_5              s__  OTU_5
#> OTU_6              s__  OTU_6
#> OTU_7              s__  OTU_7
#> OTU_8  s__sublineage I  OTU_8
#> OTU_9              s__  OTU_9
#> OTU_10             s__ OTU_10

# load example data
d <- amp_load(
  otutable = example_otutable,
  metadata = example_metadata,
  taxonomy = example_taxonomy
)

# show a summary of the data
d
#> ampvis2 object with 3 elements. 
#> Summary of OTU table:
#>      Samples         OTUs  Total#Reads    Min#Reads    Max#Reads Median#Reads 
#>            8           10        32246         2522         5451         3839 
#>    Avg#Reads 
#>      4030.75 
#> 
#> Assigned taxonomy:
#>  Kingdom   Phylum    Class    Order   Family    Genus  Species 
#> 10(100%) 10(100%) 10(100%) 10(100%) 10(100%) 10(100%)   1(10%) 
#> 
#> Metadata variables: 4 
#>  SampleID, Plant, Date, Year