Load and combine data for use with other mmgenome2 functions

Loads, validates and combines multiple aspects of metagenome data into one dataframe for use with all mmgenome2 functions, including scaffold assembly sequences, scaffold coverage, essential genes, taxonomy, and more.

mmload(
  assembly,
  coverage = NULL,
  essential_genes = NULL,
  taxonomy = NULL,
  additional = NULL,
  kmer_pca = FALSE,
  kmer_BH_tSNE = FALSE,
  kmer_size = 4L,
  verbose = TRUE,
  ...
)

Arguments

assembly	(required) A character string with the path to the assembly FASTA file, or the assembly as already loaded with `readDNAStringSet`.
coverage	(required) A path to a folder to scan for coverage files, or otherwise a named `vector`, `data.frame`, or a `list` hereof containing coverage of each scaffold. The prefix `"cov_"` will be appended to all coverage column names in the output so that `mmstats` and `mmplot_cov_profiles` know which columns are coverage columns. `vector` If provided as a vector, the elements of the vector must be named by the scaffold names exactly matching those of the assembly. `data.frame` If provided as a dataframe, the first column must contain the scaffold names exactly matching those of the assembly, and any additional column(s) contain coverage of each scaffold. `list` If provided as a list, it must contain any number of `vector`'s or `data.frame`'s as described above. If names are assigned to the objects in the list, then they will be used as column names in the output (does not apply to any dataframes that may have more than 2 columns, however). `path` If a path to a folder is provided, then all files with filenames ending with `"_cov"` will be loaded (by the `fread` function) into a list of `data.frame`'s and treated as if a `list` of `data.frame`'s were provided. The filenames (stripped from extension and `"_cov"`) will then be used as column names in the output. Note: only the first 2 columns will be used in the loaded files!
essential_genes	Either a path to a CSV file (comma-delimited ",") containing the essential genes, or a 2-column dataframe with scaffold names in the first column and gene ID's in the second. Can contain duplicates. (Default: `NULL`)
taxonomy	A dataframe containing taxonomy assigned to the scaffolds. The first column must contain the scaffold names. (Default: `NULL`)
additional	A dataframe containing any additional data. The first column must contain the scaffold names. (Default: `NULL`)
kmer_pca	(Logical) Perform Principal Components Analysis of kmer nucleotide frequencies (kmer size defined by `kmer_size`) of each scaffold and merge the scores of the 3 most significant axes. (Default: `FALSE`)
kmer_BH_tSNE	(Logical) Calculate Barnes-Hut t-Distributed Stochastic Neighbor Embedding (B-H t-SNE) representations of kmer nucleotide frequencies (kmer size defined by `kmer_size`) using `Rtsne` and merge the result. Additional arguments may be required for success (passed on through `...`), refer to the documentation of `Rtsne`. This is done in parallel, thus setting the `num_threads` to the number of available cores may greatly increase the calculation time of large data. (Default: `FALSE`)
kmer_size	The kmer frequency size (k) used when `kmer_pca = TRUE` or `kmer_BH_tSNE = TRUE`. The default is tetramers (`k = 4`). (Default: `4`)
verbose	(Logical) Whether to print status messages during the loading process. (Default: `TRUE`)
...	Additional arguments are passed on to `Rtsne`.

Value

A dataframe (tibble) compatible with other mmgenome2 functions.

Author

Kasper Skytte Andersen ksa@bio.aau.dk

Examples

if (FALSE) {
library(mmgenome2)
mm <- mmload(
  assembly = "path/to/assembly.fa",
  coverage = list(
    nameofcoverage1 = read.csv("path/to/coveragetable1.csv", col.names = TRUE),
    nameofcoverage2 = read.csv("path/to/coveragetable2.csv", col.names = TRUE)
  ),
  essential_genes = "path/to/ess_genes.txt",
  verbose = TRUE
)
mm
}