Loads, validates and combines multiple aspects of metagenome data into one dataframe for use with all mmgenome2 functions, including scaffold assembly sequences, scaffold coverage, essential genes, taxonomy, and more.

mmload(
  assembly,
  coverage = NULL,
  essential_genes = NULL,
  taxonomy = NULL,
  additional = NULL,
  kmer_pca = FALSE,
  kmer_BH_tSNE = FALSE,
  kmer_size = 4L,
  verbose = TRUE,
  ...
)

Arguments

assembly

(required) A character string with the path to the assembly FASTA file, or the assembly as already loaded with readDNAStringSet.

coverage

(required) A path to a folder to scan for coverage files, or otherwise a named vector, data.frame, or a list hereof containing coverage of each scaffold. The prefix "cov_" will be appended to all coverage column names in the output so that mmstats and mmplot_cov_profiles know which columns are coverage columns.

vector

If provided as a vector, the elements of the vector must be named by the scaffold names exactly matching those of the assembly.

data.frame

If provided as a dataframe, the first column must contain the scaffold names exactly matching those of the assembly, and any additional column(s) contain coverage of each scaffold.

list

If provided as a list, it must contain any number of vector's or data.frame's as described above. If names are assigned to the objects in the list, then they will be used as column names in the output (does not apply to any dataframes that may have more than 2 columns, however).

path

If a path to a folder is provided, then all files with filenames ending with "_cov" will be loaded (by the fread function) into a list of data.frame's and treated as if a list of data.frame's were provided. The filenames (stripped from extension and "_cov") will then be used as column names in the output. Note: only the first 2 columns will be used in the loaded files!

essential_genes

Either a path to a CSV file (comma-delimited ",") containing the essential genes, or a 2-column dataframe with scaffold names in the first column and gene ID's in the second. Can contain duplicates. (Default: NULL)

taxonomy

A dataframe containing taxonomy assigned to the scaffolds. The first column must contain the scaffold names. (Default: NULL)

additional

A dataframe containing any additional data. The first column must contain the scaffold names. (Default: NULL)

kmer_pca

(Logical) Perform Principal Components Analysis of kmer nucleotide frequencies (kmer size defined by kmer_size) of each scaffold and merge the scores of the 3 most significant axes. (Default: FALSE)

kmer_BH_tSNE

(Logical) Calculate Barnes-Hut t-Distributed Stochastic Neighbor Embedding (B-H t-SNE) representations of kmer nucleotide frequencies (kmer size defined by kmer_size) using Rtsne and merge the result. Additional arguments may be required for success (passed on through ...), refer to the documentation of Rtsne. This is done in parallel, thus setting the num_threads to the number of available cores may greatly increase the calculation time of large data. (Default: FALSE)

kmer_size

The kmer frequency size (k) used when kmer_pca = TRUE or kmer_BH_tSNE = TRUE. The default is tetramers (k = 4). (Default: 4)

verbose

(Logical) Whether to print status messages during the loading process. (Default: TRUE)

...

Additional arguments are passed on to Rtsne.

Value

A dataframe (tibble) compatible with other mmgenome2 functions.

Author

Kasper Skytte Andersen ksa@bio.aau.dk

Examples

if (FALSE) { library(mmgenome2) mm <- mmload( assembly = "path/to/assembly.fa", coverage = list( nameofcoverage1 = read.csv("path/to/coveragetable1.csv", col.names = TRUE), nameofcoverage2 = read.csv("path/to/coveragetable2.csv", col.names = TRUE) ), essential_genes = "path/to/ess_genes.txt", verbose = TRUE ) mm }