Download and uncompress the Macintosh version of Gegenees. Gegenees requires java version 1.8.x or higher. Note, on macintosh the commandline version of java is not always the same as the graphical version. Gegenees uses the command line version. To check your java version, open a terminal window and type 'java -version'. If java is installed in the command line environment, version information will be displayed. The version must be 1.8.x or higher. If you need to install or update your command line version of java, download and install the latest Java Development Kit (JDK) version. Note, it must be the JDK version, not the JRE version for it to become accessible from the command line. Gegenees is started by double-clicking the Gegenees app. The commonly used Oracle version is no longer free. If you want, you may use a Open JDK version instead. Check out 'https://openjdk.java.net/' or the Azul zulu build of open JDK 'https://www.azul.com/downloads/zulu/'. To install the 'openjdk.java.net' version, download it, uncompress it (e.g. with the archive utility app or through the terminal with a commad such as 'tar xf openjdk-11+28_osx-x64_bin.tar.gz'). Finally move the folder to the correct location. In a terminal beeing in the correct directory use something like 'sudo mv jdk-11.jdk /Library/Java/JavaVirtualMachines/'.
Note, the 'app' is a directory containing the executables. You may open the directory by right clicking and selecting 'show package content'. In 'Contents/MacOS', the actual program is located. If it has lost its execution premission (eg on a USB stick), the app will not open. To check this opren a terminal ant type 'cd ' (including a space) and drag the 'MacOS' directory to the terminal. Type 'ls -l'. The Gegenees file shod have x (execute) (-rwxrwxrwx is OK, -rw-rw-rw- is not OK). You may set the execution flag by typing somthing like 'chmod a+rwx Gegenees'
Gegenees depends upon the standalone executable version of NCBI Blast. Some versions of NCBI Blast has under some operating systams (mainly windows) had problems with multithreading. This version of Gegenees has been tested with the latest BLAST version. NCBI Blast can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Download the version that matches your operating system (Windows: win64.tar.gz or alternatively the installer wrapped win64.exe version Macintosh: macosx.tar.gz Linux: linux.tar.gz) Extract the archive file and place it in a pathway that do NOT contain spaces. Note, spaces are NOT allowed anywhere in the blast pathway (directories and all parent directories).
Windows: Eventually you need a tool to unpack the archive such as 7-zip. If you do not have permissions/possibilities to unpack/install the blast on your computer, you may extract the files on another computer and move them to the computer of interest.
Macintosh: It should be possible uncompress the tar.gz file with e.g. the archive utility app. Alternatively use the terminal style extraction method in the terminal window (see Linux below).
Linux: extract by the Archive Manager or in a terminal with a command such as 'tar -zxvf ncbi-blast-2.6.0...tar.gz'. The blast path (which is the path to the 'bin' directory of the extracted directory structure) must then be specified in the 'File->Configure Blast path...' dialog of Gegenees.
When a new comparison is made (New comparison wizard is completed), you will end up on this tab. You start the comparison by pressing 'start'. The logwindow and the statusbar will show progression of the alignment. After some initial work, the logwindow will be dominated by messages such as 'Thread=12/625 (P=12) blast producing: G8_G17.result' and 'Thread=6/625 (P=12) done: OK!'. In this case there are 625 blastcommands that needs to be run (each containing all fragments from a certain genome compared to an unfragmented sequence of another genome). The P value represents how many parallel threads are executing BLAST command. It is based on the number of processor cores reported by the system. It can be limited in the preferences page. Eventually some threads may not produce a valid blast-result file in reasonable time and then the tread is killed and that particular blast job is started again after all other threads have finished. A few restarts may be tolerable, but massive amounts of fails/restarts indicate something is wrong. After some post-processing, the logwindow will reload it contents and start with a row 'COMPLETED!'. The data can then be explored in the other tabs. The 'The Run Alignments Tab' has little function, except looking at the logfile, after the comparison has completed (a reduced version of the log file will be loaded). The logfile can also be looked at in a text editor. It is called 'logfilecomparison.txt' and is located in the 'alignment_analysis' directory of the comparison directory.
In this tab, a heatmap based on the average fragment similarity between the genomes is displayed. The color profile can be changed and the rows/columns can be sorted by similarity. Autosort, sorts the heatmap table by similarity. A new sort-order is given a name and stored so that it can be reloaded. Also, a threshold can be set so that the fragments with poor alignment is filtered out. This allows the 'core genomes' to be compared. However, if too much of the genome is filtered out, the data becomes unreliable. How much of the genome that is included at a certain threshold level can be investigated by changing from 'show score' to 'show core'. Heatmaps can be exported in different file formats for use in other programs.
Note, genomes must be grouped into target and background groups and a reference genome in the target group must be selected to use this function (Group settings tab).
In this tab, the selected reference genome is plotted and on top of it 'biomarker score' values are plotted above and below. Default is (max/min) biomarker score upwards, which means the max value in the background group (worst false positive) and the min value in the target group (the worst false negative) is used to calculate the score. This is the most stringent way too look at signatures and require full conservation in the target group and no trace of the sequence in the background group for high scoring. Default downward plott is (max/average) which uses average of the background group to calculate the score (less stringent). Even less stringence is obtained if (Average/average) is selected.
If you select a single genome as the target group, you will highlight what is unique for this genome.
In this tab, fragments with different ranges of biomarker score can be sorted out and analyzed. Type the range you want to investigate and then press show fragments. You may select fragments and display the actual sequence by 'Show sequence'.
The database manager can be launched through the 'Data' menu. It shows the content of the current database and allows for changing database and creating new databses. Genomes can be selected and e.g., renamed, deleted or copied between databases. Genomes can be imported from the filesystem (e.g., if you have fasta files).
You can also start an FTP client window that helps you download genomes from the NCBI FTP site. The FTP client downloads a list-file with the avaliable genomes and then saves it in the 'FTP' directory and re-uses it until you press 'Reload FTP content'. The bacteria-lists are large and keeping a local copy saves time. Depending on your computer and OS, the table showing the FTP-content may become slow (Linux version seem slower during testing). You may limit the amount of genomes shown by typing a filter text and then press Apply. This search will be 'Genomename contains filtertext'. If you want 'Genomename starts with filtertext' behaviour, start the filter text with a '^'. The filter '^A' shows all genomes beginning with an A.
You may select to download additional filetypes as well. This is only only to be used if you need these files for other purposes. Export functions are beeing developed.
You may select Overwrite mode, which erases erlier versions of the genome and replaces it with a new version.
You may select 'Perform download with minimal progress log', Which may be used in large downloads to minimize the risk that the user interface interferes with the download process.
You may select skip-unzip step after download, to keep the files in a '.gz' format. Better support for keeping the genomes in '.gz' format all the time is beeing developed.
If you have a unreliable connection, the FTP may 'hang'. A control-thread will detect this and restart the current transfer after about five minutes.
Troubleshooting general installation is described under Installation section.
Problems during comparisons:
'EXITED WITH ERROR!!! (error 0)'. Is a error code indicating BLAST is not found or cannot be executed. The current version checks that it has a BLAST path before the new comparison wizard can be started, so this error is probably less frequent, but still can occure if there is a problem with blast. Other blast exit codes include: 1= error in quiry sequence (fragment). 2=Error in database sequence (unfragmented genome). 3=Error in blast engine.
If the comparison starts OK but then starts to produce error 0 messages, it may be related to the system setting of 'Max simultanious open files '. Escpecially if you run small genomes with many threads, the Java garbage collector may not keep up with the closing of temporary files. Increasing this limit may solve the problem.
The scores you get in gegenees is a percentage, but it is not really directly translatable to percent nucleotide identity. It is the average value of the fragments blast-scores-expressed as a percentage of the score it would yield towards itself (at 100% identity). This value may drop somewhat faster than the actual nucleotide percent identity, and this is especially true when using short fragments. This is because some fragments will not produce a hit and will then end up with a zero score. This will draw down the total average. On the other hand, the gegenees score will separate phylogenetic groups at much lower values than nucleotide identity would. Using thresholds reduce this phenomena.
The heatmap may in some cases be asymetric. This can arise if most of the fragments in one genome is found in the other, but not the other way around. this can be because of genome reduction evoulution or presence of large plasmids. It can also be a contamination in the assembly (meaning a second contaminating genome has been assembles in addition to the genome studied, usually with lowere coverage).
Please send feedback comments and suggestions to firstname.lastname@example.org or email@example.com
A small fragment size combined with a small step size was named as 'acurrate' in previous version. This referred to a possibility to find small signature regions in genomes (e.g. unique regions for e.g., PCR primer design). In most cases the 500/500 seting is faster and may even be better (produces less zero-score fragments). Exeptions may be if you are looking for small signature regions or if you are working with very small genomes.
If you are interested in very large datasets or want a fast answer you may set a large step size. (e.g. Fragment size 500, step size 5000) This will speed up the comparison an produce a heatmap based on sampling the genomes (in this case around 10% of the genomes). However, this will make signature analysis impossible.