chem16S-package {chem16S} | R Documentation |
Chemical metrics for microbial communities
Description
Functions and data to calculate chemical metrics for reference proteomes for microbial (archaeal and bacterial) communities. Amino acid compositions of community reference proteomes are generated by combining reference proteomes of taxa (derived from GTDB or RefSeq) with taxonomic classifications of 16S rRNA gene sequences.
Details
-
read_RDP
- Read and filter an RDP Classifier table -
map_taxa
- Map RDP Classifier assignments to the NCBI taxonomy -
get_metrics
- Get chemical metrics for community reference proteomes -
get_metadata
- Example function for retrieving sample metadata -
plot_metrics
- Plot metrics with symbols and colors based on metadata -
physeq
- Functions designed to analyzephyloseq-class
objects
History:
Work begun in 2021. The combination of RefSeq reference proteomes with taxonomic abundances to compute community-level chemical metrics was described by Dick and Tan (2023). chem16S originated as code in the JMDplots package (https://github.com/jedick/JMDplots).
Development in 2022. Dick and Meng (2023) compared community ZC with redox potential measurements from local to global scales. The term “community reference proteomes” was first applied, and chem16S was split into a separate package.
Late 2022. GTDB r207 was added as a reference database.
June–July 2023. Integration with phyloseq and addition of vignettes: Chemical metrics of reference proteomes, Integration of chem16S with phyloseq, and Plotting two chemical metrics. Default reference database changed to GTDB r207.
April 2024. Updated to GTDB r214.
July 2024. Updated to GTDB r220.
Options set in package chem16S
chem16S sets an option using the global options
mechanism in R.
This option will be set when package chem16S (or its namespace) is loaded if not already set.
manual_mappings
-
A data frame of mappings between RDP and NCBI (RefSeq) taxonomies, which is read from ‘extdata/manual_mappings.csv’. The columns include
RDP.rank
,RDP.name
,NCBI.rank
,NCBI.name
, andnotes
. This option is made available so the user can modify the manual mappings used bymap_taxa
at runtime.
Files in RefDB/RefSeq_206
NOTE: None of the ‘*.R’ files in the ‘extdata’ directories are included in the package submitted to CRAN; see GitHub or Zenodo for these files.
This directory contains two sets of files: 1) scripts to process source RefSeq sequence files to generate amino acid compositions of species-level reference proteomes and taxonomic names; 2) script and output for amino acid compositions of higher-level taxa. The files are based on RefSeq release 206 of 2021-05-21 (O'Leary et al., 2016).
- ‘README.txt’
Description of steps to generate reference proteomes of species-level taxa (including downloads and shell commands).
- ‘gencat.sh’
Helper script to extract microbial protein records from the RefSeq catalog.
- ‘genome_AA.R’
-
R code to sum the amino acid compositions of all proteins for each bacterial, archaeal, and viral species in the NCBI Reference Sequence database. NOTE: To save space in this package, the output file (‘genome_AA.csv’) is stored in the
RefDB/RefSeq_206
directory of the JMDplots package on GitHub (https://github.com/jedick/JMDplots. The first five columns are:protein
(“refseq”),organism
(taxonomic id),ref
(organism name),abbrv
(empty),chains
(number of protein sequences for this organism). Columns 6 to 25 have the counts of amino acids. - ‘taxonomy.R’
-
R code for processing taxonomic IDs; the output file is ‘taxonomy.csv’. The columns are NCBI taxonomic ID (taxid), and names at different taxonomic rank (species, genus, family, order, class, phylum, superkingdom).
- ‘taxon_AA.R’
Functions to create the files listed below:
- ‘taxon_AA.csv.xz’
Average amino acid composition of reference proteomes for all species in each genus, family, order, class, phylum, and superkingdom.
Files in RefDB/GTDB_220
- ‘taxon_AA.R’
Functions to process GTDB source files (Parks et al., 2022) and produce the following output file:
- ‘taxon_AA.csv.xz’
-
Average amino acid composition of reference proteomes for all species in each genus, family, order, class, phylum, and domain. In both this file and the corresponding file for RefSeq (see above), the
protein
,organism
,ref
, andabbrv
columns contain the rank, taxon name, number of species used to generate the amino acid composition of this taxon, and parent taxon.chains
is1
, denoting a single polypeptide chain, so the amino acid composition represents the average per-protein amino acid composition in this taxon, and the sum of amino acid counts is the average protein length.
Files in extdata/metadata
- ‘BGPF13.csv’
Metadata for Heart Lake Geyser Basin, Yellowstone (Bowen De León et al., 2012).
- ‘HLA+16.csv’
Metadata for the Baltic Sea (Herlemann et al., 2016).
- ‘SMS+12.csv’
Metadata for Bison Pool, Yellowstone (Swingley et al., 2012).
Files in extdata/RDP
Output of RDP Classifier with the default training set.
- ‘pipeline.R’
-
Pipeline for sequence data processing (uses external programs fastq-dump, vsearch, seqtk, RDP Classifier). This was used to make the files in both ‘RDP’ and ‘RDP-GTDB_220’ (the latter with
GTDB = TRUE
in the script). - ‘BGPF13.tab.xz’
Heart Lake Geyser Basin.
- ‘HLA+16.tab.xz’
Baltic Sea.
- ‘SMS+12.tab.xz’
Bison Pool.
Files in extdata/RDP-GTDB_220
Output of RDP Classifer trained with 16S rRNA sequences from GTDB release 220 (doi:10.5281/zenodo.7633099).
- ‘BGPF13.tab.xz’
Heart Lake Geyser Basin.
- ‘HLA+16.tab.xz’
Baltic Sea.
- ‘SMS+12.tab.xz’
Bison Pool.
Files in extdata/DADA2-GTDB_214
Identification and taxonomic classification of sequences using DADA2 with GTDB r214.
- ‘FEN+22’
-
Analysis of data from Fonseca et al. (2022) for marine sediment from the Humboldt Sulfuretum. ‘pipeline.R’ has the commands used to process the 16S rRNA gene sequence data and was adapted by Jeffrey Dick from the DADA2 pipeline tutorial (Callahan, 2020). ‘SraRunInfo.csv’ was obtained from the NCBI Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA251688). ‘sample_data.csv’ has data obtained from NCBI BioSample records for BioProject PRJNA251688. ‘*.png’ are several plots created while running the DADA2 pipeline. ‘ps_FEN+22.rds’ contains the phyloseq object with (including
otu_table
,sample_data
, andrefseq
objects) created at the end of the DADA2 pipeline.
- ‘ZFZ+23’
-
Analysis of data from Zhang et al. (2023) for hot springs in the Qinghai-Tibet Plateau. ‘pipeline.R’ has the commands used to process the 16S rRNA gene sequence data and was adapted by Jeffrey Dick from the DADA2 pipeline tutorial (Callahan, 2020). ‘SraRunInfo.csv’ was obtained from the NCBI Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA860942). ‘sample_data.csv’ has data obtained from NCBI BioSample records for BioProject PRJNA860942. ‘*.png’ are several plots created while running the DADA2 pipeline. ‘ps_ZFZ+23.rds’ contains the phyloseq object with (including
otu_table
,sample_data
, andrefseq
objects) created at the end of the DADA2 pipeline.
References
Bowen De León K, Gerlach R, Peyton BM, Fields MW. 2013. Archaeal and bacterial communities in three alkaline hot springs in Heart Lake Geyser Basin, Yellowstone National Park. Frontiers in Microbiology 4: 330. doi:10.3389/fmicb.2013.00330
Callahan B. 2020. DADA2 Pipeline Tutorial (1.16). https://benjjneb.github.io/dada2/tutorial.html, accessed on 2023-06-14.
Dick JM, Meng D. 2023. Community- and genome-based evidence for a shaping influence of redox potential on bacterial protein evolution. mSystems 8(3): e00014-23. doi:10.1128/msystems.00014-23
Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85: 1338–1355. doi:10.1007/s00248-022-01988-9
Fonseca A, Espinoza C, Nielsen LP, Marshall IPG, Gallardo VA. 2022. Bacterial community of sediments under the Eastern Boundary Current System shows high microdiversity and a latitudinal spatial pattern. Frontiers in Microbiology 13: 1016418. doi:10.3389/fmicb.2022.1016418
Herlemann DPR, Lundin D, Andersson AF, Labrenz M, Jürgens K. 2016. Phylogenetic signals of salinity and season in bacterial community composition across the salinity gradient of the Baltic Sea. Frontiers in Microbiology 7: 1883. doi:10.3389/fmicb.2016.01883
O'Leary NA et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44: D733-D745. doi:10.1093/nar/gkv1189
Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P. 2022. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research 50: D785–D794. doi:10.1093/nar/gkab776
Swingley WD, Meyer-Dombard DR, Shock EL, Alsop EB, Falenski HD, Havig JR, Raymond J. 2012. Coordinating environmental genomics and geochemistry reveals metabolic transitions in a hot spring ecosystem. PLOS One 7(6): e38108. doi:10.1371/journal.pone.0038108
Zhang H-S, Feng Q-D, Zhang D-Y, Zhu G-L, Yang L. 2023. Bacterial community structure in geothermal springs on the northern edge of Qinghai-Tibet plateau. Frontiers in Microbiology 13: 994179. doi:10.3389/fmicb.2022.994179