R: Map taxonomic names to NCBI or GTDB taxonomy

map_taxa {chem16S}

R Documentation

Map taxonomic names to NCBI or GTDB taxonomy

Description

Maps taxonomic names to NCBI (RefSeq) or GTDB taxonomy by automatic matching of taxonomic names, with manual mappings for some groups.

Usage

  map_taxa(taxacounts = NULL, refdb = "GTDB_220", taxon_AA = NULL, quiet = FALSE)

Arguments

`taxacounts`	data frame with taxonomic name and abundances
`refdb`	character, name of reference database (‘⁠GTDB_220⁠’ or ‘⁠RefSeq_206⁠’)
`taxon_AA`	data frame, amino acid compositions of taxa, used to bypass `refdb` specification
`quiet`	logical, suppress printed messages?

Details

This function maps taxonomic names to the NCBI (RefSeq) or GTDB taxonomy. taxacounts should be a data frame generated by either read_RDP or ps_taxacounts. Input names are made by combining the taxonomic rank and name with an underscore separator (e.g. ‘⁠genus_ Escherichia/Shigella⁠’). Input names are then matched to the taxa listed in ‘taxon_AA.csv.xz’ found under ‘RefDB/RefSeq_206’ or ‘RefDB/GTDB_220’. The protein and organism columns in these files hold the rank and taxon name extracted from the RefSeq or GTDB database. Only exactly matching names are automatically mapped.

For mapping to the NCBI (RefSeq) taxonomy, some group names are manually mapped as follows (see Dick and Tan, 2023):

RDP training set	NCBI
genus_Escherichia/Shigella	genus_Escherichia
phylum_Cyanobacteria/Chloroplast	phylum_Cyanobacteria
genus_Marinimicrobia_genera_incertae_sedis	species_Candidatus Marinimicrobia bacterium
class_Cyanobacteria	phylum_Cyanobacteria
genus_Spartobacteria_genera_incertae_sedis	species_Spartobacteria bacterium LR76
class_Planctomycetacia	class_Planctomycetia
class_Actinobacteria	phylum_Actinobacteria
order_Rhizobiales	order_Hyphomicrobiales
genus_Gp1	genus_Acidobacterium
genus_Gp6	genus_Luteitalea
genus_GpI	genus_Nostoc
genus_GpIIa	genus_Synechococcus
genus_GpVI	genus_Pseudanabaena
family_Family II	family_Synechococcaceae
genus_Subdivision3_genera_incertae_sedis	family_Verrucomicrobia subdivision 3
order_Clostridiales	order_Eubacteriales
family_Ruminococcaceae	family_Oscillospiraceae

To avoid manual mapping, GTDB can be used for both taxonomic assignemnts and reference proteomes. Taxonomic assignments based on 16S rRNA sequences from GTDB can be made using training files for the RDP Classifier (doi:10.5281/zenodo.7633099) or dada2 (doi:10.5281/zenodo.2541238) (make sure to choose the appropriate GTDB version). Example files created using the RDP Classifier are provided under ‘extdata/RDP-GTDB_220’. An example dataset created with DADA2 is data(mouse.GTDB_214); this is a phyloseq-class object that can be processed with functions described at physeq.

Change quiet to TRUE to suppress printing of messages about manual mappings, most abundant unmapped groups, and overall percentage of mapped names.

Value

Integer vector with length equal to number of rows of taxacounts. Values are rownumbers in the data frame generated by reading taxon_AA.csv.xz, or NA for no matching taxon. Attributes unmapped_groups and unmapped_percent have the input names of unmapped groups and their percentage of the total classification count.

References

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85: 1338–1355. doi:10.1007/s00248-022-01988-9

Examples

# Partial mapping from RDP training set to NCBI taxonomy
file <- system.file("extdata/RDP/SMS+12.tab.xz", package = "chem16S")
RDP <- read_RDP(file)
map <- map_taxa(RDP, refdb = "RefSeq_206")
# About 24% of classifications are unmapped
sum(attributes(map)$unmapped_percent)

# 100% mapping from GTDB training set to GTDB taxonomy
file <- system.file("extdata/RDP-GTDB_220/SMS+12.tab.xz", package = "chem16S")
RDP.GTDB <- read_RDP(file)
map.GTDB <- map_taxa(RDP.GTDB)
stopifnot(all.equal(sum(attributes(map.GTDB)$unmapped_percent), 0))

[Package chem16S version 1.1.0-5 Index]