Skip to content

Toy_example

milnus edited this page Nov 22, 2022 · 6 revisions

Toy example

In the folder Toy_example you will find two folders. One contains three genomes from Streptococcus pyogenes, two of them are complete genomes and the third is a draft genome with multiple contigs. The fourth file contains seed sequences on either side of known prophage insertion sites designated as per McShan et al. 2019. These four files can be used for a test run of Magphi using a command line the following: $ Magphi -S -s Input_data/Seed_seqeunces.fasta -g Input_data/H293_genome.gff Input_data/5448_genome.gff Input_data/draft_genome.gff -S -md 40000 -o test_output

The above line will run Mapghi with an intentionally low max distance and illustrate how Magphis evidence level can be used to get a grasp of what underlies the output.

Analysing the output files:

master_seed_evidence.csv evidence levels ranging from 2 to 5C can be found, indicating what to expect from a specific genome and seed-sequence pair. The draft genome is broken between the seed sequences for S, resulting in the evidence level of 2. The seed sequences of O were not able to reach each other in the 5448_genome leading to an evidence level of 5A. For all seed seqeunces of M were able to be connected but no annotations were found between them, giving the evidence level of 5B. Finally, all seed seqeunces of H were connected and annotations were found between them, awarding these an evidence level of 5C.
A more in-depth description of evidence levels can be found in the Quick-start tab.

seed_pairing.tsv illustrates how the inputted seed sequences were paired. All seed sequences were designated a single letter followed by either _1 or _2, depending on their numbering in the pair. The seed_pairing.tsv indicate that all seed seqeuneces were matched correctly, H_1 with H_2, K_1 with K_2 and so on.

inter_seed_distance.csv gives the distance between seed sequences, if they can be connected under the maximum allowed distance. Here it can be usefull to look for gaps, as in seed seqeunce pair O for 5448_genome. It can also be usefull to look for anormalies like H and K for 5448_genome, which are relatively larger than the distance for the same seed seqeunce pairs for the remaining two genomes.

contig_hit_matrix.csv is usually a quick check or last resort to understand results. In this case every seed seqeunce pair has two hits, equating to one for each primer.

Seed seqeunce output folder each seed seqeunce pair will have an outpt folder associated with it, named according to the first column of seed_pairing.tsv. Each folder will contain complete .fasta files for each connected pair of seed sequences, and a .gff if annotations are found between them.

Outputs for a successful run are available for comparison and to get a handle of the output from Magphi.