Fragmented all-against-all comparison

Concept

A fragmented all-against-all comparison analyzes genomes by fragmenting them and comparing all pieces with all genomes, and based on this all-against-all aproach a phylogenetic dataset can be extracted. It is also possible to group the genomes and search for unique genomic regions that has a high specificity towards a "target group".

 

Create a fragmented comparsion

To create a fragmented all-all comparison, click on “New” and then “Fragmented all-all comparison” in the menu bar.

New dialog

A wizard will help you set up the comparison by first letting you chose a name and some alignment settings. The resolution of the alignment is controlled by two parameters, the "Fragment size" and the "Sliding step size". The fragment size represents the "scanning window size" and it should be smaller than the genomic region you anticipate to find in the analysis. For bacteria, we recommend 200/100 (frag-size /slide-size ) which is more accurate and 500/500 (much faster and usually sufficient) settings. Small fragment sizes and sliding step sizes gives more demanding calculations. When working with viruses and small sequences, shorter settings may be needed. It is also possible to use tblastx (compares sequences on translated level, i.e. amino acids). This is much more demanding and the datasets should be smaller, but if the sequences are pylogenetically far apart, this may be a useful operating mode. Then you have to select genomes from your database to include in the comparison and click “Finish”. By clicking finish you will be taken to the analysis perspective.

Start aligment

When you which to start the alignment process, click the start button. The calculation progress will be shown and the log-window in the right part will show messages on what is happening. Typically, first a lot of conversion and preparation messages appears. Then the a BLAST list is created and executed in parallel "threads". Typically, each "thread" should not take more than at the most a few minutes to complete. The number of simultaneously calculating threads is indicated and also the thread number and the total number of threads that should be run. It is possible to send a pause signal and then resume the calculation later. After all the threads have been run, some data analysis is made and then the alignment is completed. Once an alignment is completed, it is possible to analyze the data in the other tabs in the analysis perspective. An alignment is represented on the hard drive by a folder with the prefix "alignment_analysis".

 

Analyze a fragmented comparsion

The Included genomes tab

To Included genomes tab lists all genomes that are included in the alignment. It is also possible to add or remove genomes to/from a finished alignment by clicking on the "modify comparison" button at the bottom of the window.

Included genomes

 

The Group settings tab

There are two ways of define groups in a fragmented all-all comparison and one of the ways is by using the Goup settings tab.

Group settings Several group settings can be created from the same dataset (e.g. different subtypes) with the "New.." or "Make a copy..." buttons.

 

The Heat plot tab

The heat plot tab gives an phylogenomic overview of the data. It is the average normalized BLAST score values of all fragments that are shown. It is also possible to define threshold values, meaning that that fragments falling under the threshold is not used to calculate the average similarity value. This gives a better phylogenetic signal since the similarity value is only based on conserved genetic material (the core genome). It is also possible to see how large the core genome is at the specified threshold (select "show core" instead of "show score"). It is possible to change the "color profile" of the heat plot so that differences are highlighted as well as possible for the particular dataset. The number of decimals shown can be changed. The genomes are sorted alphabetically, which often is sufficient. There are sorting possibilities built in for the heat plots. It is possible to move genomes or group of genomes with the "Move selection to row" field or by right clicking it and select "move" from the context menu. The target and background group settings can also be modified from the right click-context menu. If, "Drag and drop, single sorting" is selected, genomes can be dragged with the mouse, one by one. The sort is saved and if several sorts are wanted, new ones can be created with the "new..." button. There is also an "autosort" function, that tries to minimize the score distances between the rows. There is also an export button that allows of the phylogenomic data:

Heat plot If a remote genome is already in the local database, it will be colored red in the remote list.

 

The Score overview tab

The score overview tab shows a graphic representation of the "biomarker scores". Biomarker scores are score values that rank all genomic regions (fragments) in how discriminating they are for the target group in terms of conservation (no false negatives) and uniqueness (no false positives in the background). There are three types of scores with different stringency:

The biomarker scores are drawn graphically. It is possible to compare two types of scores by drawing one upwards (from the coordinate axis) and a second downwards. The graphical view is spitted into two rows in order to use the computer screen optimal (do not confuse the upper row with the upwards score, there are upwards scores in both rows). There is a possibility to exclude draft genomes in the calculations since they sometimes lack regions that in some cases may disturb the analysis. When the mouse moves over the graph, the sub-sequence, fragment number and coordinate at the cursor position is shown in the "info" part to the right. The number of fragments that each pixel column on the screen represent, is also indicated. It is also possible to see how many percent of the genome has a biomarker score over a certain threshold. This gives a good overview of the how much genomic regions one can expect. It is possible to zoom into the graph by selecting a region with the right mouse button down. If the left mouse button is used, the corresponding region is “selected”. A selection can thereafter be loaded into the tabular view for further data mining by a right click. It is possible to export the graph (as seen on the screen) as an image. It is also possible to export the data to a file that can be explored in Artemis (see section below).

Score overview

Viewing a signature in Artemis

It is possible to export an interesting sub-sequence from the genome (or the whole genome if it is completed) into a format that can be viewed in Artemis.

The export will end up in a directory called "export" under the workspace directory. It will be a "*.gbk" file that essentially is the same file as the original "gbk" file (if there are problems or warnings when loading the original file in Artemis they will remain). The "gene" and "misc feature" track is replaced by the biomarker scores. Five files are exported

 

The Score table tab

In the "score table" tab, there is details about the fragments representing either:

Fragments from a certain sub-sequence can also be selected. After setting the filtering range (or a combination of them, e.g. all fragments in the first 200 kb of sub-sequence 2 with biomarker sore over 0.8) press the "show fragments" button to load the fragments. It is possible to sort the table based by clicking the header. The type of biomarker score to be shown can be selected in the info region.

Score table

Show sequence displays the actual sequences of the fragments and it is possible to fuse adjacent and overlapping fragments into continuous sequences. The sequences can be exported to a Fasta-file or sent to a web page ready for a blast comparison at NCBI.

Show seq

The detailed scores, shows how each fragment scores against each genome in the target and background group. This may help to identify which particular strain is causing a cross reaction.

Show seq

It is also possible to export the table (as its shown) or the full data table (without filtering) as a tab delimited text file for further analysis in e.g. a spreadsheet program.