R: Functions for Reading FASTA Files and Downloading from...

util.fasta {CHNOSZ}

R Documentation

Functions for Reading FASTA Files and Downloading from UniProt

Description

Search the header lines of a FASTA file, read protein sequences from a file, count numbers of amino acids in each sequence, and download sequences from UniProt.

Usage

  read.fasta(file, iseq = NULL, ret = "count", lines = NULL, 
    ihead = NULL, start=NULL, stop=NULL, type="protein", id = NULL)
  count.aa(seq, start=NULL, stop=NULL, type="protein")

Arguments

`file`	character, path to FASTA file
`iseq`	numeric, which sequences to read from the file
`ret`	character, specification for type of return (count, sequence, or FASTA format)
`lines`	list of character, supply the lines here instead of reading them from file
`ihead`	numeric, which lines are headers
`start`	numeric, position in sequence to start counting
`stop`	numeric, position in sequence to stop counting
`type`	character, sequence type (protein or DNA)
`id`	character, value to be used for `protein` in output table
`seq`	character, amino acid sequence of a protein

Details

read.fasta is used to retrieve entries from a FASTA file. Use iseq to select the sequences to read (the default is all sequences). The function returns various formats depending on the value of ret. The default ‘⁠count⁠’ returns a data frame of amino acid counts (the data frame can be given to add.protein in order to add the proteins to thermo$protein), ‘⁠seq⁠’ returns a list of sequences, and ‘⁠fas⁠’ returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). If the line numbers of the header lines were previously determined, they can be supplied in ihead. Optionally, the lines of a previously read file may be supplied in lines (in this case no file is needed so file should be set to ""). When ret is ‘⁠count⁠’, the names of the proteins in the resulting data frame are parsed from the header lines of the file, unless id is provided. If id is not given, and a UniProt FASTA header is detected (regular expression "\|......\|.*_"), information there (accession, name, organism) is split into the protein, abbrv, and organism columns of the resulting data frame.

count.aa counts the occurrences of each amino acid or nucleic-acid base in a sequence (seq). For amino acids, the columns in the returned data frame are in the same order as thermo()$protein. The matching of letters is case-insensitive. A warning is generated if any character in seq, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations. start and/or stop can be provided to count a fragment of the sequence (extracted using substr). If only one of start or stop is present, the other defaults to 1 (start) or the length of the sequence (stop).

Value

read.fasta returns a list of sequences or lines (for ret equal to ‘⁠seq⁠’ or ‘⁠fas⁠’, respectively), or a data frame with amino acid compositions of proteins (for ret equal to ‘⁠count⁠’) with columns corresponding to those in thermo$protein.

Examples


## Reading a protein FASTA file
# The path to the file
file <- system.file("extdata/protein/EF-Tu.aln", package = "CHNOSZ")
# Read the sequences, and print the first one
read.fasta(file, ret = "seq")[[1]]
# Count the amino acids in the sequences
aa <- read.fasta(file)
# Compute lengths (number of amino acids)
protein.length(aa)

## Not run: 
## Count amino acids in a sequence
count.aa("GGSGG")
# Warnings are issued for unrecognized characters
atest <- count.aa("WhatAmIMadeOf?")
# There are 3 "A" (alanine)
atest[, "A"]

## End(Not run)

[Package CHNOSZ version 2.1.0 Index]