No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

Alien Index Calculator

The number of available genomes continues to grow exponentially, necessitating the development of efficient tools for quickly processing these data as well as generating and testing hypotheses.

Horizontal gene transfer (HGT) — the transfer of genetic material from one organism to another through a process other than reproduction — is one source of innovation that can result in the rapid acquisition of genes that contribute to ecologically important traits.

However, screening an entire genome for possible HGT is a challenging task. Why?

  • Phylogenetic trees are time consuming to construct and interpret.
  • Accurate predictions are contingent on sufficient taxon sampling.
  • The signature of HGT may be complex with multiple donors and recipients.
  • The majority of genomes are not completely sequenced and are often represented instead by dozens to thousands of assembly contigs. In fragmented genomes, true HGT can be difficult to differentiate from assembly contamination.

How it Works

We developed a BLAST-based algorithm to identify genes with significantly better BLAST hits to distantly related organisms than to closely related ones and thereby identify suspected HGT events as well as flag likely assembly contamination.

Gene(s) of interest are first searched against our custom RefSeq database. The sequence headers in this database have been reformated to be more taxonomically informative. See section on Downloading Database Files below.

With tab-delimited blast output in hand, the Alien Index (AI) score can then be calculated. The AI score requires specification of two NCBI taxonomy IDs representing the last common ancestors (LCAs) of two taxonomic lineages:

  • The recipient lineage into which possible HGT events may have occurred. This can be as specific as the query itself (i.e., to identify foreign genes unique to that genome) or to a larger clade (i.e., to identify genes that may have been horizontally acquired since the LCA of several related species).
  • The group lineage representing an even larger clade that encompases the recipient lineage and its sister clade(s).

The AI score is given by the following formula: AI = nbsO – nbsG where nbsO is the normalized bitscore of the best hit to a species outside of the group lineage and nbsG is the normalized bitscore of the best hit to a species within the group lineage (skipping all hits to the recipient lineage). Normalized bitscores are calculated as the bitscore of the best high scoring pair to the subject sequence divided by the best bitscore possible for the query sequence (i.e. the bitscore of the query aligned to itself).

AI is greater than zero if the query sequence has a better hit to a species outside of the group lineage than within. Therefore, an AI score > 0 and can be suggestive of either HGT or contamination. Note that the original Alien Index developed by Gladyshev et al. (2008) was based on relative E-values and ranged from –460 to +460, with AI >45 considered a putative HGT candidate. However, by converting the AI to a bitscore-based metric, our results are not impacted by BLAST version, database size, or computer hardware. The bitscore-based AI score ranges from -1 to +1, and we often impose AI score cutoffs of ≥0.05 or ≥0.1 depending on the project.

Installation

The Alien Index Calculator is developed to run on Mac OS X or Linux

  1. Download dependencies

  2. Download the Alien Index Calculator here, or use wget:

    wget https://www.wisecaverlab.com/s/AlienIndexCalculator_preview1.tgz
    tar xzf AlienIndexCalculator_preview1.tgz
  1. Download Database Files
        cd AlienIndexCalculator_preview1
        diamond makedb --in refseq.fa --db refseq

To Run

Double click on the AlienIndex binary to open the program.

We update the custom database periodically (approximately every 6 months), but if necessary, you may email Jen to request that the database be updated. See also the custom RefSeq GitHub page for more information on how the custom database differs the NCBI release and for information on updating the database yourself. The scripts are designed to use the PBS queue on Purdue's Research Computing Community Clusters, so would likely require some modification to run elsewhere.