This is a brief guide explaining the basics of how to use mmgenome2
for genome binning and extraction from metagenomes. It is meant to only demonstrate fundamental principles and operations. For more detailed documentation of all mmgenome2 functions, their arguments, as well as example code, please refer to the Functions page. First, install mmgenome2
as described on the frontpage, and then download the example data from figshare and unzip the contained files into a folder of choice. Preloaded data is also available with data("mmgenome2")
.
#download and unzip example data
tempfile <- tempfile("mmgenome2", fileext = ".zip")
download.file(
url = "https://ndownloader.figshare.com/files/11094290",
destfile = tempfile,
quiet = TRUE
)
unzip(tempfile, exdir = "data/")
mmgenome2
requires at minimum two types of data; the assembly and at minimum one coverage profile. Any number of coverage profiles can be loaded at once. Other than the assembly and coverage profiles, optional data includes essential genes, taxonomy, small subunit taxonomic classification, and anything else that could be relevant. The only requirement is that the data is in a row-by-column format, so that the first column contains scaffold names matching those in the assembly, and then any number of columns containing relevant information for the binning process. Metagenome data often consists of numerous files that first have to be loaded into R. The various files are loaded with the mmload
function, which checks, adjusts, and combines the data into a single data frame, here called mm
, which will be used in all other mmgenome2
functions:
library(mmgenome2)
mm <- mmload(
assembly = "data/assembly.fa",
coverage = "data/",
essential_genes = "data/essential.csv",
taxonomy = "data/tax.csv",
additional = read.csv("data/phylopythias.csv"),
verbose = TRUE,
kmer_pca = FALSE,
kmer_BH_tSNE = FALSE
)
This loads all the files into a single data frame object called mm
. By giving a folder path to coverage
, mmload
expects all files in the folder that end with "_cov"
to be coverage profiles, and will automatically load them all and name them by filename. By default the scaffold lengths as well as the GC content of each scaffold are calculated. Notice that it is also possible to perform Principal Components Analysis and t-SNE on tetra-nucleotide frequencies by setting kmer_pca = TRUE
and kmer_BH_tSNE = TRUE
, which will then also be available in the output data frame mm
to aid the binning process. Any additional data can be provided with additional
.
By simply typing the name of the object in the console we get a brief overview of the data frame:
mm
## # A tibble: 97,285 x 18
## scaffold length gc cov_C13.11.14 cov_C13.11.25 cov_C13.12.03 cov_C14.01.09
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 8264 57.8 1.44 53.6 0 0.066
## 2 2 1027 57.0 0.625 24.2 0 0
## 3 3 1665 55.9 13.5 434. 0.166 0.177
## 4 4 9056 35.9 0.01 23.4 0 0
## 5 5 3343 64.0 3.20 16.4 0 0
## 6 6 98207 39.1 0.00966 24.5 3.29 9.85
## 7 7 6480 63.0 2.61 19.2 1.46 12.3
## 8 8 15790 61.7 2.78 21.2 1.62 10.3
## 9 9 1403 70.4 85.1 192. 0 0
## 10 10 2018 70.2 50.3 101. 0 0
## # … with 97,275 more rows, and 11 more variables: geneID <chr>, phylum <chr>,
## # count <chr>, all.phylum.assignments <chr>, class <chr>,
## # all.class.assignments <chr>, pps_phylum <chr>, pps_class <chr>,
## # pps_order <chr>, pps_family <chr>, pps_genus <chr>
The mmstats
function can also calculate some basic statistics about the data:
mmstats(mm)
## General stats
## Scaffolds 97 285
## N50 5 291
## Length.total 331 908 376
## Length.max 1 446 979
## Length.mean 3 411.71
## Length.min 1 000
## weighted_GC_mean 52.22
## cov_C13.11.14 2.62
## cov_C13.11.25 16.06
## cov_C13.12.03 2.91
## cov_C14.01.09 11.58
## Ess.genes.total 7 138
## Ess.genes.unique 109
In mmgenome2
there are currently 4 different plotting functions; mmplot
, mmplot_pairs
, mmplot_cov_profiles
, and mmnetwork
. mmplot
can for example be used to generate a differential coverage plot with any two coverage profiles from the mm
data frame, for example cov_C13.11.25
and cov_C14.01.09
. It is often a good idea to apply a log10 scale to the axes, or to adjust the axis limits to get a symmetrical plot or to zoom in at selected regions. By default the axes are automatically scaled. In the following plot the scaffolds are colored by taxonomy, but we could also color by GC content or any other variable in mm
:
mmplot(mm,
x = "cov_C13.11.25",
y = "cov_C14.01.09",
color_by = "phylum",
x_scale = "log10",
y_scale = "log10")
In both mmplot
and mmnetwork
there is incorporated a locator feature, which makes it possible to interactively click in the plot and capture the exact positions of the mouse clicks on the axes plotted. This is enabled by setting locator = TRUE
and is used to extract scaffolds within a region of interest, reproducibly. The selection of points can then be highlighted as a polygon by the selection
argument:
selection <- data.frame(cov_C13.11.25 = c(7.2, 16.2, 25.2, 23.3, 10.1),
cov_C14.01.09 = c(47, 77, 52.8, 29.5, 22.1))
mmplot(mm,
x = "cov_C13.11.25",
y = "cov_C14.01.09",
min_length = 3000,
color_by = "phylum",
x_scale = "log10",
x_limits = c(1, 100),
y_scale = "log10",
y_limits = c(1, 100),
#locator = TRUE, #uncomment to use the locator and return a selection
selection = selection)
To extract all scaffolds within the selection polygon (including those not shown in the plot if a min_length
has been set) from mm
use mmextract
and provide the selection in the same way as with mmplot(mm, ..., selection = selection)
, and then save in a new object mm_subset1
:
mm_subset1 <- mmextract(mm, selection = selection)
## 61 scaffolds (or 1.34% of the scaffolds in mm, weighted by length) remain after 97224 of 97285 scaffolds have been filtered.
Then the extracted set of scaffolds can be plotted separately to get a closer look (no log10 scaled axes or min_length
set this time):
mmplot(mm_subset1,
x = "cov_C13.11.25",
y = "cov_C14.01.09",
color_by = "phylum")
To look for any eventual contaminants or to have a look at the scaffolds at different coverage profiles we can use the mmplot_pairs
function to plot more than two coverage profiles at the same time as well as any other variables. To see which variables that are available to plot, either type str(mm)
or simply just the name of the object in the console again (it is recommended to only plot continuous/numerical variables like coverage or GC content etc.). Note that this time no log10 scale is applied to the axes and the points are not scaled by scaffold size, but is set to a fixed size 3
instead:
str(mm)
## tibble[,18] [97,285 × 18] (S3: tbl_df/tbl/data.frame)
## $ scaffold : chr [1:97285] "1" "2" "3" "4" ...
## $ length : int [1:97285] 8264 1027 1665 9056 3343 98207 6480 15790 1403 2018 ...
## $ gc : num [1:97285] 57.8 57 55.9 35.9 64 ...
## $ cov_C13.11.14 : num [1:97285] 1.443 0.625 13.536 0.01 3.205 ...
## $ cov_C13.11.25 : num [1:97285] 53.6 24.1 434.1 23.4 16.4 ...
## $ cov_C13.12.03 : num [1:97285] 0 0 0.166 0 0 ...
## $ cov_C14.01.09 : num [1:97285] 0.066 0 0.177 0 0 ...
## $ geneID : chr [1:97285] NA NA NA NA ...
## $ phylum : chr [1:97285] NA NA NA NA ...
## $ count : chr [1:97285] NA NA NA NA ...
## $ all.phylum.assignments: chr [1:97285] NA NA NA NA ...
## $ class : chr [1:97285] NA NA NA NA ...
## $ all.class.assignments : chr [1:97285] NA NA NA NA ...
## $ pps_phylum : chr [1:97285] "Proteobacteria" NA "Proteobacteria" NA ...
## $ pps_class : chr [1:97285] "Gammaproteobacteria" NA "Betaproteobacteria" NA ...
## $ pps_order : chr [1:97285] NA NA "Hydrogenophilales" NA ...
## $ pps_family : chr [1:97285] NA NA "Hydrogenophilaceae" NA ...
## $ pps_genus : chr [1:97285] NA NA NA NA ...
mmplot_pairs(mm_subset1,
variables = c("cov_C13.11.14",
"cov_C13.11.25",
"cov_C13.12.03",
"cov_C14.01.09",
"gc"),
color_by = "phylum",
x_scale = "log10",
y_scale = "log10",
alpha = 0.4,
size_scale = 0.7,
textsize = 4)
#Second extraction from the initial extraction Using different coverage profiles than in the initial extraction, for example cov_C13.11.14
and cov_C13.12.03
, it may be easier to spot and remove contaminants. Note that we now override the previous selection
with different coordinates on different axes:
selection <- data.frame(cov_C13.12.03 = c(3.48, 4.95, 6.97, 13.6, 15.7, 9.68, 4.48),
cov_C13.11.14 = c(0.407, 1.72, 2.92, 1.45, 0.264, 0.17, 0.163))
mmplot(mm_subset1,
x = "cov_C13.12.03",
y = "cov_C13.11.14",
x_scale = "log10",
y_scale = "log10",
color_by = "phylum",
selection = selection)
Now we can extract the scaffolds into a second subset of scaffolds, let’s save it as mm_subset2
:
mm_subset2 <- mmextract(mm_subset1, selection = selection)
## 45 scaffolds (or 98.46% of the scaffolds in mm, weighted by length) remain after 16 of 61 scaffolds have been filtered.
Until now we have only extracted scaffolds based on coverage profiles. Using paired-end reads we can look for scaffolds that may have had a different coverage (fx due to repeats) than the current set of scaffolds, even though they may be related. This is visualised by the mmnetwork
function, which creates a network plot (generated by igraph) of connected scaffolds. First we need to load the paired-end data:
paired_ends <- read.csv("data/paired_ends.csv")
mmnetwork(mm_subset2,
network = paired_ends,
color_by = "phylum",
min_connections = 1)
The thickness of the lines between the scaffolds is then scaled by the number of connections between the scaffolds according to the paired-end reads. Only the scaffolds in mm_subset2
are plotted, and these may have further connections to other scaffolds that are not in the current subset, for example due to repeats and low-abundant microdiversity. To find additional connected scaffolds in the full dataset mm
we use the mmexpand_network
function. Here we set the minimum connections to 1 and only direct connections are extracted (meaning no connections of connections):
mm_subset2_exp <- mmexpand_network(mm = mm,
scaffolds = mm_subset2,
network = paired_ends,
min_connections = 1,
include_connections = "direct")
## The provided 45 scaffolds had 36 additional direct connections to other scaffolds in mm.
## Total: 81 scaffolds (or 2.24% of the scaffolds in mm, weighted by length).
As the message states, there is 36 additional connections to the 45 scaffolds in mm_subset2
, which are now included in the new object mm_subset2_exp
. This has also introduced 3 groups of scaffolds which seem unrelated to the genome bin, as seen below. To further refine the genome bin we can now plot the expanded subset again to identify these contaminants and again use the locator to remove them:
selection <- data.frame(x = c(0.521, -0.43, 3.028, 7.376, 10.709, 13.462, 14.622, 14.216, 10.94, 3.238),
y = c(3.443, 10.425, 14.678, 13.155, 9.664, 6.934, 6.649, 4.586, 0.936, 0.945))
p <- mmnetwork(mm_subset2_exp,
network = paired_ends,
color_by = "phylum",
#locator = TRUE, #uncomment to mark a selection to highlight and extract
min_connections = 1,
selection = selection)
p
mm_subset3 <- p$data_in_selection #there is no separate mmextract() function for network plots
Beware that with mmnetwork
there is no separate extract function like mmextract
is for mmplot
. This is because the axes in a network plot are synthetic and are not directly contained within the data like for example coverage profiles are. Therefore, to extract all scaffolds in the selected subspace in the network plot it is instead available in the returned plot object. If for example the network plot is assigned to the object p
, then the subset data in the selection is available with p$data_in_selection
. The above selection then removed 20 scaffolds and there is now 61 scaffolds remaining:
mm_subset3
## # A tibble: 61 x 18
## scaffold length gc cov_C13.11.14 cov_C13.11.25 cov_C13.12.03 cov_C14.01.09
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 180 22093 74.0 0.298 12.0 7.34 37.9
## 2 181 2970 68.0 0.34 13.9 7.37 38.6
## 3 227 190826 65.4 0.348 13.0 6.91 38.5
## 4 253 296715 70.2 1.06 15.2 7.00 41.1
## 5 315 218862 71.7 0.366 12.6 7.17 38.8
## 6 343 27203 57.6 0.235 10.7 5.07 32.0
## 7 386 164942 68.2 0.359 12.3 6.73 37.2
## 8 396 7448 72.6 0.329 13.0 7.59 41.0
## 9 423 100199 68.7 1.40 15.3 6.71 40.0
## 10 431 446220 72.4 0.409 12.6 6.80 37.9
## # … with 51 more rows, and 11 more variables: geneID <chr>, phylum <chr>,
## # count <chr>, all.phylum.assignments <chr>, class <chr>,
## # all.class.assignments <chr>, pps_phylum <chr>, pps_class <chr>,
## # pps_order <chr>, pps_family <chr>, pps_genus <chr>
#Final trimming and evaluation Using again the coverage profiles we can identify correct repeats and remove additional contamination with the locator. Here the sizes of the scaffolds are set to a fixed value by fixed_size = 5
:
selection = data.frame(cov_C14.01.09 = c(19, 15.5, 69.1, 103, 106, 54.9),
cov_C13.11.25 = c(7.28, 17.2, 163, 165, 21.4, 7.76))
mmplot(mm_subset3,
x = "cov_C14.01.09",
y = "cov_C13.11.25",
x_scale = "log10",
y_scale = "log10",
x_limits = c(1, 200),
y_limits = c(1, 200),
#locator = TRUE,
color_by = "phylum",
fixed_size = 5,
selection = selection)
And then the final bin is extracted and saved into a final object bin1
:
bin1 <- mmextract(mm_subset3,
selection = selection)
## 44 scaffolds (or 96.61% of the scaffolds in mm, weighted by length) remain after 17 of 61 scaffolds have been filtered.
To evaluate genome completeness and to get an overview of the basic information about bin1
it may be useful to have a look at the output of mmstats
of the bin:
mmstats(bin1)
## General stats
## Scaffolds 44
## N50 199 460
## Length.total 4 386 010
## Length.max 454 223
## Length.mean 99 682.05
## Length.min 1 241
## weighted_GC_mean 71.11
## cov_C13.11.14 0.56
## cov_C13.11.25 13.36
## cov_C13.12.03 6.94
## cov_C14.01.09 38.8
## Ess.genes.total 108
## Ess.genes.unique 105
We can then highlight the final genome bin in a differential coverage plot of the complete metagenome by setting highlight_scaffolds = bin1
:
mmplot(mm,
x = "cov_C14.01.09",
y = "cov_C13.11.25",
x_scale = "log10",
y_scale = "log10",
highlight_scaffolds = bin1)
#Export the scaffolds in the genome bin from the assembly Now that we are happy with the genome bin we can export the sequences of the scaffolds from the assembly to a separate FASTA file using mmexport
and repeat the analysis to make another bin:
mmexport(bin1,
assembly = assembly,
file = "bins/bin1.fa")
After each genome extraction it is often a good idea to clear the current working environment to ensure a clean workflow with no accidental mixing of objects. To make sure that the same scaffolds are not used for different bins, the mmscanbins
function can be used to scan a bin folder for FASTA files and return a list with all scaffold names found in those files. This is useful to highlight already extracted scaffolds in plots or to simply identify them and remove them from other bins.