Gegenees Fragmented Aligner version 3.1

Please look for updates at www.gegenees.org and report any problems to bo.segerman@sva.se or info@gegenees.org

Installation

Windows

Download and extract the windows version of Gegenees. If you have a 64-Bit environment, download the 'x_86_64' version and if you have a 32-bit environment download the 'x86' version. Note, even if you have a 64-bit windows version, the java version may be 32-bit. To check this: Open a command prompt (search for cmd in the windows menu) and type 'java -version'. If java is installed a version will be reported. The java version must be 1.8.x or higher. If the java version is 64-bit, there should be a statement about '64-Bit' somewhere in the version information. If java stops and returns 'exit code 13' when you try to start Gegenees, you are probably trying to run a 64-bit Gegenees version using a 32-bit Java version. Gegenees is started by double-clicking the Gegenees program located under the eclipse folder. The commonly used Oracle version is no longer free. If you want, you may use a Open JDK version instead. Check out 'https://openjdk.java.net/' or the Azul zulu build of open JDK 'https://www.azul.com/downloads/zulu/'

Macintosh

Download and uncompress the Macintosh version of Gegenees. Gegenees requires java version 1.8.x or higher. Note, on macintosh the commandline version of java is not always the same as the graphical version. Gegenees uses the command line version. To check your java version, open a terminal window and type 'java -version'. If java is installed in the command line environment, version information will be displayed. The version must be 1.8.x or higher. If you need to install or update your command line version of java, download and install the latest Java Development Kit (JDK) version. Note, it must be the JDK version, not the JRE version for it to become accessible from the command line. Gegenees is started by double-clicking the Gegenees app. The commonly used Oracle version is no longer free. If you want, you may use a Open JDK version instead. Check out 'https://openjdk.java.net/' or the Azul zulu build of open JDK 'https://www.azul.com/downloads/zulu/'. To install the 'openjdk.java.net' version, download it, uncompress it (e.g. with the archive utility app or through the terminal with a commad such as 'tar xf openjdk-11+28_osx-x64_bin.tar.gz'). Finally move the folder to the correct location. In a terminal beeing in the correct directory use something like 'sudo mv jdk-11.jdk /Library/Java/JavaVirtualMachines/'.

Note, the 'app' is a directory containing the executables. You may open the directory by right clicking and selecting 'show package content'. In 'Contents/MacOS', the actual program is located. If it has lost its execution premission (eg on a USB stick), the app will not open. To check this opren a terminal ant type 'cd ' (including a space) and drag the 'MacOS' directory to the terminal. Type 'ls -l'. The Gegenees file shod have x (execute) (-rwxrwxrwx is OK, -rw-rw-rw- is not OK). You may set the execution flag by typing somthing like 'chmod a+rwx Gegenees'

Linux

Download and extract the Linux version of Gegenees. If you have a 64-Bit environment, download the 'x_86_64' version and if you have a 32-bit environment download the 'x86' version. Note, even if you have a 64-bit Linux version, the java version may be 32-bit. To check this: Open a command prompt (search for cmd in the windows menu) and type 'java -version'. If java is installed a version will be reported. The java version must be 1.8.x or higher. If the java version is 64-bit, there should be a statement about '64-Bit' somewhere in the version information. If java stops and returns 'exit code 13' when you try to start Gegenees, you are probably trying to run a 64-bit Gegenees version using a 32-bit Java version. Gegenees is started by double-clicking the Gegenees program located under the eclipse folder. If the Gegenees program has been transfered via another filesystem (e.g. windows or a USB key), The execution flag may have been lost. Right click the Gegenees file, select Properties and Premissions and check the 'Allow executing file as program' checkbox (or in the terminal use e.g., 'chmod a+rwx Gegenees') .

Getting started

The workspace

The first time you start Gegenees, you need to select a workspace directory (File->Select workspace...). If you have no Gegenees workspace for precious, you may create/select an empty directory. The workspace directory will contain your databases with genomic sequences (a sub-directory named 'database' or starting with 'database_') and comparisons (a subdirectory starting with 'comparison_'). If you use the FTP function in the database manager, a directory called ftp will be created. This directory holds the listings from ncbi. You may have several different workspaces for your different projects. If you press the 'Projects' tab, you may see the name of the active workspace in the bottom of the window. Pathways containing spaces may cause problems for command-line programs such as BLAST. Gegenees therfore need a workspace paths containing NO spaces (both dhe workspace directory name and all parental directories).

BLAST

Gegenees depends upon the standalone executable version of NCBI Blast. Some versions of NCBI Blast has under some operating systams (mainly windows) had problems with multithreading. This version of Gegenees has been tested with the latest BLAST version. NCBI Blast can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Download the version that matches your operating system (Windows: win64.tar.gz or alternatively the installer wrapped win64.exe version Macintosh: macosx.tar.gz Linux: linux.tar.gz) Extract the archive file and place it in a pathway that do NOT contain spaces. Note, spaces are NOT allowed anywhere in the blast pathway (directories and all parent directories).

Windows: Eventually you need a tool to unpack the archive such as 7-zip. If you do not have permissions/possibilities to unpack/install the blast on your computer, you may extract the files on another computer and move them to the computer of interest.

Macintosh: It should be possible uncompress the tar.gz file with e.g. the archive utility app. Alternatively use the terminal style extraction method in the terminal window (see Linux below).

Linux: extract by the Archive Manager or in a terminal with a command such as 'tar -zxvf ncbi-blast-2.6.0...tar.gz'. The blast path (which is the path to the 'bin' directory of the extracted directory structure) must then be specified in the 'File->Configure Blast path...' dialog of Gegenees.

Databases

In a workspace there is a directory called 'database'. This is the 'default database' that stores genome sequences that can be used to set up a comparison. Each genome is stored in a directory which is named as the genome name and in it one or several genbank-formated files. The directory name should not contain spaces and ends with a 'type tag' surrounded by two '-' characters (e.g., '--Complete_Genome--' or '--Contig--'. Eventually Gegenees will make an 'info.geg' file with some statistics on the genome. It is also possible to have more databases which then is a directory beginning with 'database_'. The databases are handled by the 'Database Manager' which can be launched from the Data->Database manager menu command. You may also nandle genomes in a file explorer (win File explorer/mac Finder/ Linux Files). The genomes may be in a '.gz' format and will then be uncompressed when used (better support for compressed storage is coming). When starting a new comparison, the available genomes comes from the currently active database. If you need to change database, do it through the database manager. To ensure correct formatting, genomes (in.eg. fasta format) may be imported into a database using the database manager.

The Projects tab

The Projects tab lists all comparisons in the current workspace dir. By selecting a comparison the contents of the other tabs will be changed and reflect the selected comparison. In an empty workspace there will be no comparisons listed. A new comparison can be initiated by (File->New Comparison...). In the bottom line, the path of the current workspace is listed.

The Run Alignments Tab

When a new comparison is made (New comparison wizard is completed), you will end up on this tab. You start the comparison by pressing 'start'. The logwindow and the statusbar will show progression of the alignment. After some initial work, the logwindow will be dominated by messages such as 'Thread=12/625 (P=12) blast producing: G8_G17.result' and 'Thread=6/625 (P=12) done: OK!'. In this case there are 625 blastcommands that needs to be run (each containing all fragments from a certain genome compared to an unfragmented sequence of another genome). The P value represents how many parallel threads are executing BLAST command. It is based on the number of processor cores reported by the system. It can be limited in the preferences page. Eventually some threads may not produce a valid blast-result file in reasonable time and then the tread is killed and that particular blast job is started again after all other threads have finished. A few restarts may be tolerable, but massive amounts of fails/restarts indicate something is wrong. After some post-processing, the logwindow will reload it contents and start with a row 'COMPLETED!'. The data can then be explored in the other tabs. The 'The Run Alignments Tab' has little function, except looking at the logfile, after the comparison has completed (a reduced version of the log file will be loaded). The logfile can also be looked at in a text editor. It is called 'logfilecomparison.txt' and is located in the 'alignment_analysis' directory of the comparison directory.

 

The Included Genomes Tab

This Tab lists the genomes in the comparison. Information about the genomes can be displayed and the comparison can be modified by adding or removing genomes. This will require a new round of calculations to be started in the 'Run alignment' tab.

The Group settings Tab

This Tab allows the included genomes to be assignes to the 'background' or the 'target' group. The Signature tabs uses this information. A signature represents what is conserved in the target but absent from the background group. A reference genome can also be selected, which will define the coordinate system when the signature is graphically plotted.

The heatmap Tab

In this tab, a heatmap based on the average fragment similarity between the genomes is displayed. The color profile can be changed and the rows/columns can be sorted by similarity. Autosort, sorts the heatmap table by similarity. A new sort-order is given a name and stored so that it can be reloaded. Also, a threshold can be set so that the fragments with poor alignment is filtered out. This allows the 'core genomes' to be compared. However, if too much of the genome is filtered out, the data becomes unreliable. How much of the genome that is included at a certain threshold level can be investigated by changing from 'show score' to 'show core'. Heatmaps can be exported in different file formats for use in other programs.

The Signature graph Tab

Note, genomes must be grouped into target and background groups and a reference genome in the target group must be selected to use this function (Group settings tab).

In this tab, the selected reference genome is plotted and on top of it 'biomarker score' values are plotted above and below. Default is (max/min) biomarker score upwards, which means the max value in the background group (worst false positive) and the min value in the target group (the worst false negative) is used to calculate the score. This is the most stringent way too look at signatures and require full conservation in the target group and no trace of the sequence in the background group for high scoring. Default downward plott is (max/average) which uses average of the background group to calculate the score (less stringent). Even less stringence is obtained if (Average/average) is selected.

If you select a single genome as the target group, you will highlight what is unique for this genome.

The Signature Table Tab

In this tab, fragments with different ranges of biomarker score can be sorted out and analyzed. Type the range you want to investigate and then press show fragments. You may select fragments and display the actual sequence by 'Show sequence'.

Database manager

The database manager can be launched through the 'Data' menu. It shows the content of the current database and allows for changing database and creating new databses. Genomes can be selected and e.g., renamed, deleted or copied between databases. Genomes can be imported from the filesystem (e.g., if you have fasta files).

You can also start an FTP client window that helps you download genomes from the NCBI FTP site. The FTP client downloads a list-file with the avaliable genomes and then saves it in the 'FTP' directory and re-uses it until you press 'Reload FTP content'. The bacteria-lists are large and keeping a local copy saves time. Depending on your computer and OS, the table showing the FTP-content may become slow (Linux version seem slower during testing). You may limit the amount of genomes shown by typing a filter text and then press Apply. This search will be 'Genomename contains filtertext'. If you want 'Genomename starts with filtertext' behaviour, start the filter text with a '^'. The filter '^A' shows all genomes beginning with an A.

You may select to download additional filetypes as well. This is only only to be used if you need these files for other purposes. Export functions are beeing developed.

You may select Overwrite mode, which erases erlier versions of the genome and replaces it with a new version.

You may select 'Perform download with minimal progress log', Which may be used in large downloads to minimize the risk that the user interface interferes with the download process.

You may select skip-unzip step after download, to keep the files in a '.gz' format. Better support for keeping the genomes in '.gz' format all the time is beeing developed.

If you have a unreliable connection, the FTP may 'hang'. A control-thread will detect this and restart the current transfer after about five minutes.

Troubleshooting

Troubleshooting general installation is described under Installation section.

 

Problems during comparisons:

'EXITED WITH ERROR!!! (error 0)'. Is a error code indicating BLAST is not found or cannot be executed. The current version checks that it has a BLAST path before the new comparison wizard can be started, so this error is probably less frequent, but still can occure if there is a problem with blast. Other blast exit codes include: 1= error in quiry sequence (fragment). 2=Error in database sequence (unfragmented genome). 3=Error in blast engine.

If the comparison starts OK but then starts to produce error 0 messages, it may be related to the system setting of 'Max simultanious open files '. Escpecially if you run small genomes with many threads, the Java garbage collector may not keep up with the closing of temporary files. Increasing this limit may solve the problem.

 

The meaning of the values in the heatmap

The scores you get in gegenees is a percentage, but it is not really directly translatable to percent nucleotide identity. It is the average value of the fragments blast-scores-expressed as a percentage of the score it would yield towards itself (at 100% identity). This value may drop somewhat faster than the actual nucleotide percent identity, and this is especially true when using short fragments. This is because some fragments will not produce a hit and will then end up with a zero score. This will draw down the total average. On the other hand, the gegenees score will separate phylogenetic groups at much lower values than nucleotide identity would. Using thresholds reduce this phenomena.

The heatmap may in some cases be asymetric. This can arise if most of the fragments in one genome is found in the other, but not the other way around. this can be because of genome reduction evoulution or presence of large plasmids. It can also be a contamination in the assembly (meaning a second contaminating genome has been assembles in addition to the genome studied, usually with lowere coverage).

Please send feedback comments and suggestions to bo.segerman@sva.se or info@gegenees.org

 

About Fragment size

A small fragment size combined with a small step size was named as 'acurrate' in previous version. This referred to a possibility to find small signature regions in genomes (e.g. unique regions for e.g., PCR primer design). In most cases the 500/500 seting is faster and may even be better (produces less zero-score fragments). Exeptions may be if you are looking for small signature regions or if you are working with very small genomes.

Fast run:

If you are interested in very large datasets or want a fast answer you may set a large step size. (e.g. Fragment size 500, step size 5000) This will speed up the comparison an produce a heatmap based on sampling the genomes (in this case around 10% of the genomes). However, this will make signature analysis impossible.