Inspect BioProspector Results

The predict promoter_signal.py workflow generates 3 main output files after running BioProspector N times as well as N raw files (BioProspector's direct program output). This notebook demonstrates how to inspect the output files and understand and interpret their contents.

  1. Raw BioProspector output file
  2. BioPropsector Summary file
  3. BioProspector Margin of Victory file
  4. BioProspector best promoter selection file
In [1]:
import altair as alt
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textwrap import wrap
import warnings; warnings.simplefilter('ignore')

import sys
sys.path.append('../') # use modules in main directory

import bioprospector_utils as bu
from bioprospector_utils import BioProspectorResult,SeqMotifMatch_2B,Motif_2B
import consensus_viz_utils as cu
In [2]:
# indicate the bioprospector output files and 
# directories created during predict_promoter_signal.py

promoter_f =       '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc.fa'

biop_raw_dir =     '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_BIOP_RAW/'
biop_summary_f =   '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_SUMMARY.tsv'
biop_mov_f =       '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_TOP_3_MOV.tsv'
biop_selection_f = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_SELECTION.fa'

1. Examine raw bioprospector output file

Each run of BioProspector produces a raw file, which is parsed during predict_promoter_signal.py to extract the information about the sequence matches to the motifs it finds. Looking at the raw .txt file is fairly human readable. This notebook section just shows some convenient functions for inspecting it via custom Python data structures.

In [3]:
# pick one of the raw BioProspector files from the raw folder
biop_raw_f = os.path.join(biop_raw_dir, 'biop_run1.txt')

# load it into a BioProspectorResult object
biop_result = BioProspectorResult(biop_raw_f,promoter_f)
In [4]:
print("This BioProspector Result file reports:")
print(f"* {len(biop_result.motifs)} found motifs")

# view motif logos
biop_result.view_motifs()
This BioProspector Result file reports:
* 5 found motifs
In [5]:
# View further details of the consensus and location of 
# where this motif was identified in each of the input sequences
for m in biop_result.motifs:
    m.view_motif()
    m.pprint()
Motif 1
Block 1: GCTTGA (TCAAGC)
Block 2: CCTATA (TATAGG)
Score: 2.224, Sites: 33

Number of Seq Matches: 33


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 1 match instance #1
Length: 141, Block 1: 43, Block 2: 64
GTCTGG -- (15) -- TTTCTA
[ 93 ] -- ---- -- [ 72 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 1 match instance #1
Length: 300, Block 1: 253, Block 2: 275
GATTGT -- (16) -- CCTTTA
[ 42 ] -- ---- -- [ 20 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 1 match instance #2
Length: 300, Block 1: 147, Block 2: 168
GCTTAA -- (15) -- CCTGTA
[ 148 ] -- ---- -- [ 127 ]


Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 1 match instance #1
Length: 175, Block 1: 45, Block 2: 66
GCTTGA -- (15) -- TCGCAA
[ 125 ] -- ---- -- [ 104 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 1 match instance #1
Length: 300, Block 1: 34, Block 2: 55
GACTAT -- (15) -- CCTATA
[ 261 ] -- ---- -- [ 240 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 1 match instance #2
Length: 300, Block 1: 262, Block 2: 285
GACTAA -- (17) -- TTTTTA
[ 33 ] -- ---- -- [ 10 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 1 match instance #1
Length: 300, Block 1: 250, Block 2: 272
ACTTGA -- (16) -- CTTATA
[ 45 ] -- ---- -- [ 23 ]


Seq: EQU24_RS15745||cold-shock protein
Motif 1 match instance #1
Length: 136, Block 1: 36, Block 2: 58
GACTGA -- (16) -- CTGATT
[ 95 ] -- ---- -- [ 73 ]


Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 1 match instance #1
Length: 93, Block 1: 35, Block 2: 59
GATTGA -- (18) -- CCTATA
[ 53 ] -- ---- -- [ 29 ]


Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 1 match instance #1
Length: 176, Block 1: 85, Block 2: 109
GACTAT -- (18) -- TTGACA
[ 86 ] -- ---- -- [ 62 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 1 match instance #1
Length: 300, Block 1: 37, Block 2: 59
CCTTGG -- (16) -- TTGTCA
[ 258 ] -- ---- -- [ 236 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 1 match instance #1
Length: 300, Block 1: 23, Block 2: 44
GCTTAT -- (15) -- CTTATA
[ 272 ] -- ---- -- [ 251 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 1 match instance #2
Length: 300, Block 1: 54, Block 2: 77
GTATGT -- (17) -- TTTTCA
[ 241 ] -- ---- -- [ 218 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 1 match instance #1
Length: 300, Block 1: 129, Block 2: 153
CCCTGG -- (18) -- TCTATA
[ 166 ] -- ---- -- [ 142 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 1 match instance #2
Length: 300, Block 1: 185, Block 2: 208
GTCTAG -- (17) -- CTGATC
[ 110 ] -- ---- -- [ 87 ]


Seq: EQU24_RS21565||transaldolase
Motif 1 match instance #1
Length: 300, Block 1: 15, Block 2: 36
GTATAT -- (15) -- TTGATA
[ 280 ] -- ---- -- [ 259 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 1 match instance #1
Length: 300, Block 1: 60, Block 2: 84
GCTTGA -- (18) -- CCGACA
[ 235 ] -- ---- -- [ 211 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 1 match instance #2
Length: 300, Block 1: 169, Block 2: 193
GCTTGT -- (18) -- CCGTTT
[ 126 ] -- ---- -- [ 102 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 1 match instance #1
Length: 300, Block 1: 148, Block 2: 172
CCATGG -- (18) -- TCGATA
[ 147 ] -- ---- -- [ 123 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 1 match instance #1
Length: 300, Block 1: 148, Block 2: 171
CACTAG -- (17) -- TTGACA
[ 147 ] -- ---- -- [ 124 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 1 match instance #1
Length: 300, Block 1: 92, Block 2: 116
CCCTGT -- (18) -- CCGTCA
[ 203 ] -- ---- -- [ 179 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 1 match instance #2
Length: 300, Block 1: 163, Block 2: 187
GTATGA -- (18) -- CCTCTA
[ 132 ] -- ---- -- [ 108 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 1 match instance #1
Length: 182, Block 1: 64, Block 2: 86
GCTTGA -- (16) -- TTGATA
[ 113 ] -- ---- -- [ 91 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 1 match instance #2
Length: 182, Block 1: 26, Block 2: 49
GCTTGA -- (17) -- TCTGTA
[ 151 ] -- ---- -- [ 128 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 1 match instance #1
Length: 129, Block 1: 37, Block 2: 59
GCATGA -- (16) -- CCGTAA
[ 87 ] -- ---- -- [ 65 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 1 match instance #1
Length: 160, Block 1: 34, Block 2: 55
AACTGT -- (15) -- TCTTTA
[ 121 ] -- ---- -- [ 100 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 1 match instance #1
Length: 205, Block 1: 53, Block 2: 77
ACTTGG -- (18) -- TTTTTA
[ 147 ] -- ---- -- [ 123 ]


Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 1 match instance #1
Length: 148, Block 1: 52, Block 2: 75
GCTTAG -- (17) -- TCTTCA
[ 91 ] -- ---- -- [ 68 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 1 match instance #1
Length: 300, Block 1: 29, Block 2: 50
CCCTGT -- (15) -- CCGAAC
[ 266 ] -- ---- -- [ 245 ]


Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 1 match instance #1
Length: 131, Block 1: 19, Block 2: 40
GCATGA -- (15) -- CCTTCA
[ 107 ] -- ---- -- [ 86 ]


Seq: EQU24_RS12095||cytochrome c
Motif 1 match instance #1
Length: 300, Block 1: 222, Block 2: 243
CCCTAT -- (15) -- CTTTTA
[ 73 ] -- ---- -- [ 52 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 1 match instance #1
Length: 286, Block 1: 244, Block 2: 267
CTCTGT -- (17) -- CCGTTA
[ 37 ] -- ---- -- [ 14 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 1 match instance #2
Length: 286, Block 1: 156, Block 2: 177
GCATAG -- (15) -- CTTTCA
[ 125 ] -- ---- -- [ 104 ]

Motif 2
Block 1: TTGTAG (CTACAA)
Block 2: TTATAG (CTATAA)
Score: 2.209, Sites: 33

Number of Seq Matches: 33


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 2 match instance #1
Length: 141, Block 1: 65, Block 2: 89
TTCTAA -- (18) -- CTATTG
[ 71 ] -- ---- -- [ 47 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 2 match instance #1
Length: 300, Block 1: 29, Block 2: 51
TTGACA -- (16) -- CTATTG
[ 266 ] -- ---- -- [ 244 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 2 match instance #2
Length: 300, Block 1: 237, Block 2: 258
TTATAG -- (15) -- TTATAG
[ 58 ] -- ---- -- [ 37 ]


Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 2 match instance #1
Length: 175, Block 1: 47, Block 2: 69
TTGACA -- (16) -- CAATAT
[ 123 ] -- ---- -- [ 101 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 2 match instance #1
Length: 300, Block 1: 133, Block 2: 157
TTGAAA -- (18) -- TTTTAG
[ 162 ] -- ---- -- [ 138 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 2 match instance #1
Length: 300, Block 1: 8, Block 2: 29
TTGTCG -- (15) -- TTAGAT
[ 287 ] -- ---- -- [ 266 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 2 match instance #2
Length: 300, Block 1: 59, Block 2: 83
TCAAAG -- (18) -- CTATAG
[ 236 ] -- ---- -- [ 212 ]


Seq: EQU24_RS15745||cold-shock protein
Motif 2 match instance #1
Length: 136, Block 1: 38, Block 2: 60
CTGAAA -- (16) -- GATTAG
[ 93 ] -- ---- -- [ 71 ]


Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 2 match instance #1
Length: 93, Block 1: 37, Block 2: 60
TTGACA -- (17) -- CTATAG
[ 51 ] -- ---- -- [ 28 ]


Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 2 match instance #1
Length: 176, Block 1: 91, Block 2: 113
TTCTCA -- (16) -- CAATAG
[ 80 ] -- ---- -- [ 58 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 2 match instance #1
Length: 300, Block 1: 187, Block 2: 210
TTCTCG -- (17) -- CTTTCG
[ 108 ] -- ---- -- [ 85 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #1
Length: 300, Block 1: 194, Block 2: 215
CTCTAA -- (15) -- CTATAT
[ 101 ] -- ---- -- [ 80 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #2
Length: 300, Block 1: 1, Block 2: 25
TTGTAA -- (18) -- TTATAT
[ 294 ] -- ---- -- [ 270 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #3
Length: 300, Block 1: 159, Block 2: 180
TCCTAA -- (15) -- TTATTG
[ 136 ] -- ---- -- [ 115 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 264, Block 2: 286
TTCTAA -- (16) -- CAATTG
[ 31 ] -- ---- -- [ 9 ]


Seq: EQU24_RS21565||transaldolase
Motif 2 match instance #1
Length: 300, Block 1: 186, Block 2: 209
TCGACG -- (17) -- TTTTCG
[ 109 ] -- ---- -- [ 86 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 235, Block 2: 257
TTCAAA -- (16) -- TAATCT
[ 60 ] -- ---- -- [ 38 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 2 match instance #1
Length: 300, Block 1: 169, Block 2: 193
TCCTCG -- (18) -- TTTTAT
[ 126 ] -- ---- -- [ 102 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 2 match instance #1
Length: 300, Block 1: 159, Block 2: 181
TCATAA -- (16) -- TTTTCG
[ 136 ] -- ---- -- [ 114 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 2 match instance #1
Length: 300, Block 1: 231, Block 2: 253
TTGTAA -- (16) -- CAATAG
[ 64 ] -- ---- -- [ 42 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 2 match instance #2
Length: 300, Block 1: 77, Block 2: 99
CTCTCG -- (16) -- TTTTCT
[ 218 ] -- ---- -- [ 196 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 2 match instance #1
Length: 182, Block 1: 50, Block 2: 72
CTGTAG -- (16) -- CTTTAT
[ 127 ] -- ---- -- [ 105 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 2 match instance #1
Length: 129, Block 1: 89, Block 2: 110
TTGTCG -- (15) -- TTTTTG
[ 35 ] -- ---- -- [ 14 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 2 match instance #2
Length: 129, Block 1: 59, Block 2: 80
CCGTAA -- (15) -- TTTGTG
[ 65 ] -- ---- -- [ 44 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 2 match instance #1
Length: 160, Block 1: 36, Block 2: 58
CTGTCG -- (16) -- TTATAT
[ 119 ] -- ---- -- [ 97 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 2 match instance #2
Length: 160, Block 1: 95, Block 2: 117
TCGAAG -- (16) -- CATTTG
[ 60 ] -- ---- -- [ 38 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 2 match instance #1
Length: 205, Block 1: 166, Block 2: 189
CTGAAT -- (17) -- CATGAG
[ 34 ] -- ---- -- [ 11 ]


Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 2 match instance #1
Length: 148, Block 1: 113, Block 2: 134
CTCAAG -- (15) -- TAATCG
[ 30 ] -- ---- -- [ 9 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 229, Block 2: 250
CTGACT -- (15) -- TTTGAG
[ 66 ] -- ---- -- [ 45 ]


Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 2 match instance #1
Length: 131, Block 1: 79, Block 2: 101
CTCTAG -- (16) -- TTATTG
[ 47 ] -- ---- -- [ 25 ]


Seq: EQU24_RS12095||cytochrome c
Motif 2 match instance #1
Length: 300, Block 1: 11, Block 2: 35
CTCACG -- (18) -- CTTGCT
[ 284 ] -- ---- -- [ 260 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 2 match instance #1
Length: 286, Block 1: 205, Block 2: 227
TTCACG -- (16) -- TATTAG
[ 76 ] -- ---- -- [ 54 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 2 match instance #2
Length: 286, Block 1: 10, Block 2: 34
TCCAAG -- (18) -- TATGAG
[ 271 ] -- ---- -- [ 247 ]

Motif 3
Block 1: GCTTGA (TCAAGC)
Block 2: TCTATG (CATAGA)
Score: 2.199, Sites: 32

Number of Seq Matches: 32


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 3 match instance #1
Length: 141, Block 1: 43, Block 2: 66
GTCTGG -- (17) -- TCTAAG
[ 93 ] -- ---- -- [ 70 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 3 match instance #1
Length: 300, Block 1: 140, Block 2: 161
GTTTGA -- (15) -- GCTTTG
[ 155 ] -- ---- -- [ 134 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 3 match instance #2
Length: 300, Block 1: 27, Block 2: 50
GTTTGA -- (17) -- GCTATT
[ 268 ] -- ---- -- [ 245 ]


Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 3 match instance #1
Length: 175, Block 1: 45, Block 2: 68
GCTTGA -- (17) -- GCAATA
[ 125 ] -- ---- -- [ 102 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 3 match instance #1
Length: 300, Block 1: 131, Block 2: 153
GCTTGA -- (16) -- GATGTT
[ 164 ] -- ---- -- [ 142 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 3 match instance #1
Length: 300, Block 1: 69, Block 2: 91
CCGTGA -- (16) -- TCTATG
[ 226 ] -- ---- -- [ 204 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 3 match instance #2
Length: 300, Block 1: 250, Block 2: 274
ACTTGA -- (18) -- TATAAA
[ 45 ] -- ---- -- [ 21 ]


Seq: EQU24_RS15745||cold-shock protein
Motif 3 match instance #1
Length: 136, Block 1: 36, Block 2: 60
GACTGA -- (18) -- GATTAG
[ 95 ] -- ---- -- [ 71 ]


Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 3 match instance #1
Length: 93, Block 1: 35, Block 2: 58
GATTGA -- (17) -- TCCTAT
[ 53 ] -- ---- -- [ 30 ]


Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 3 match instance #1
Length: 176, Block 1: 30, Block 2: 51
GCATCG -- (15) -- TCATTG
[ 141 ] -- ---- -- [ 120 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 3 match instance #1
Length: 300, Block 1: 12, Block 2: 36
GCGTCA -- (18) -- GCCTTG
[ 283 ] -- ---- -- [ 259 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 3 match instance #2
Length: 300, Block 1: 80, Block 2: 101
GTTTCA -- (15) -- TACAAG
[ 215 ] -- ---- -- [ 194 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 3 match instance #1
Length: 300, Block 1: 34, Block 2: 55
ATATGA -- (15) -- TATGTG
[ 261 ] -- ---- -- [ 240 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 3 match instance #1
Length: 300, Block 1: 129, Block 2: 153
CCCTGG -- (18) -- TCTATA
[ 166 ] -- ---- -- [ 142 ]


Seq: EQU24_RS21565||transaldolase
Motif 3 match instance #1
Length: 300, Block 1: 215, Block 2: 237
GCATGA -- (16) -- GCCGTG
[ 80 ] -- ---- -- [ 58 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 3 match instance #1
Length: 300, Block 1: 27, Block 2: 51
GCTTGA -- (18) -- TCCATG
[ 268 ] -- ---- -- [ 244 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 3 match instance #2
Length: 300, Block 1: 153, Block 2: 174
ACTTGG -- (15) -- TCTAAG
[ 142 ] -- ---- -- [ 121 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 3 match instance #3
Length: 300, Block 1: 222, Block 2: 246
GTGTGG -- (18) -- TCTTTT
[ 73 ] -- ---- -- [ 49 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 3 match instance #1
Length: 300, Block 1: 71, Block 2: 94
CCATGG -- (17) -- TCCATG
[ 224 ] -- ---- -- [ 201 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 3 match instance #2
Length: 300, Block 1: 262, Block 2: 283
GCGTCA -- (15) -- TCAGTA
[ 33 ] -- ---- -- [ 12 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 3 match instance #1
Length: 300, Block 1: 37, Block 2: 61
GAATGA -- (18) -- GCTTTT
[ 258 ] -- ---- -- [ 234 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 3 match instance #1
Length: 300, Block 1: 198, Block 2: 221
GATTCG -- (17) -- GCTAAG
[ 97 ] -- ---- -- [ 74 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 3 match instance #1
Length: 182, Block 1: 26, Block 2: 49
GCTTGA -- (17) -- TCTGTA
[ 151 ] -- ---- -- [ 128 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 3 match instance #2
Length: 182, Block 1: 64, Block 2: 88
GCTTGA -- (18) -- GATATA
[ 113 ] -- ---- -- [ 89 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 3 match instance #1
Length: 129, Block 1: 37, Block 2: 58
GCATGA -- (15) -- TCCGTA
[ 87 ] -- ---- -- [ 66 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 3 match instance #1
Length: 160, Block 1: 36, Block 2: 59
CTGTCG -- (17) -- TATATG
[ 119 ] -- ---- -- [ 96 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 3 match instance #1
Length: 205, Block 1: 141, Block 2: 165
ATTTGA -- (18) -- GCTGAA
[ 59 ] -- ---- -- [ 35 ]


Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 3 match instance #1
Length: 148, Block 1: 6, Block 2: 28
GCGTGG -- (16) -- GCTAGG
[ 137 ] -- ---- -- [ 115 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 3 match instance #1
Length: 300, Block 1: 200, Block 2: 222
AATTCA -- (16) -- TCTATA
[ 95 ] -- ---- -- [ 73 ]


Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 3 match instance #1
Length: 131, Block 1: 19, Block 2: 43
GCATGA -- (18) -- TCAGAA
[ 107 ] -- ---- -- [ 83 ]


Seq: EQU24_RS12095||cytochrome c
Motif 3 match instance #1
Length: 300, Block 1: 113, Block 2: 137
GAATGA -- (18) -- GCTTTA
[ 182 ] -- ---- -- [ 158 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 3 match instance #1
Length: 286, Block 1: 203, Block 2: 227
GCTTCA -- (18) -- TATTAG
[ 78 ] -- ---- -- [ 54 ]

Motif 4
Block 1: CTGACA (TGTCAG)
Block 2: TATAGT (ACTATA)
Score: 2.197, Sites: 31

Number of Seq Matches: 31


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 4 match instance #1
Length: 141, Block 1: 6, Block 2: 28
GTGATA -- (16) -- TATAAT
[ 130 ] -- ---- -- [ 108 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 4 match instance #1
Length: 300, Block 1: 29, Block 2: 52
TTGACA -- (17) -- TATTGT
[ 266 ] -- ---- -- [ 243 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 4 match instance #2
Length: 300, Block 1: 236, Block 2: 259
GTTATA -- (17) -- TATAGT
[ 59 ] -- ---- -- [ 36 ]


Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 4 match instance #1
Length: 175, Block 1: 143, Block 2: 164
CCGACA -- (15) -- TATAAT
[ 27 ] -- ---- -- [ 6 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 4 match instance #1
Length: 300, Block 1: 23, Block 2: 46
GTGAAT -- (17) -- AATACC
[ 272 ] -- ---- -- [ 249 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 4 match instance #1
Length: 300, Block 1: 17, Block 2: 38
CTTATA -- (15) -- AATTCG
[ 278 ] -- ---- -- [ 257 ]


Seq: EQU24_RS15745||cold-shock protein
Motif 4 match instance #1
Length: 136, Block 1: 25, Block 2: 48
TTTATA -- (17) -- AATTGC
[ 106 ] -- ---- -- [ 83 ]


Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 4 match instance #1
Length: 93, Block 1: 37, Block 2: 61
TTGACA -- (18) -- TATAGT
[ 51 ] -- ---- -- [ 27 ]


Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 4 match instance #1
Length: 176, Block 1: 93, Block 2: 114
CTCACA -- (15) -- AATAGT
[ 78 ] -- ---- -- [ 57 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 4 match instance #1
Length: 300, Block 1: 236, Block 2: 257
CTGACA -- (15) -- AAAAGC
[ 59 ] -- ---- -- [ 38 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 4 match instance #1
Length: 300, Block 1: 24, Block 2: 46
CTTATA -- (16) -- TATAAG
[ 271 ] -- ---- -- [ 249 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 4 match instance #1
Length: 300, Block 1: 194, Block 2: 218
TTTACT -- (18) -- TATAGT
[ 101 ] -- ---- -- [ 77 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 4 match instance #2
Length: 300, Block 1: 266, Block 2: 287
CTAACA -- (15) -- AATTGG
[ 29 ] -- ---- -- [ 8 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 4 match instance #3
Length: 300, Block 1: 60, Block 2: 81
CTAAAA -- (15) -- AATAGT
[ 235 ] -- ---- -- [ 214 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 4 match instance #4
Length: 300, Block 1: 131, Block 2: 155
CTGGAA -- (18) -- TATACT
[ 164 ] -- ---- -- [ 140 ]


Seq: EQU24_RS21565||transaldolase
Motif 4 match instance #1
Length: 300, Block 1: 71, Block 2: 94
CCGATA -- (17) -- AATACT
[ 224 ] -- ---- -- [ 201 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 4 match instance #1
Length: 300, Block 1: 181, Block 2: 203
TTTATA -- (16) -- TATAAT
[ 114 ] -- ---- -- [ 92 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 4 match instance #1
Length: 300, Block 1: 172, Block 2: 196
TCGATA -- (18) -- TATTCT
[ 123 ] -- ---- -- [ 99 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 4 match instance #1
Length: 300, Block 1: 171, Block 2: 193
TTGACA -- (16) -- TAAACT
[ 124 ] -- ---- -- [ 102 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 4 match instance #2
Length: 300, Block 1: 217, Block 2: 241
CTGGTA -- (18) -- TAAAGG
[ 78 ] -- ---- -- [ 54 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 4 match instance #1
Length: 300, Block 1: 233, Block 2: 254
GTAACA -- (15) -- AATAGG
[ 62 ] -- ---- -- [ 41 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 4 match instance #1
Length: 182, Block 1: 16, Block 2: 40
GTTAAA -- (18) -- AATACT
[ 161 ] -- ---- -- [ 137 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 4 match instance #1
Length: 129, Block 1: 94, Block 2: 117
GCGAAA -- (17) -- TATAGG
[ 30 ] -- ---- -- [ 7 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 4 match instance #1
Length: 160, Block 1: 113, Block 2: 137
TTAACA -- (18) -- AATTGT
[ 42 ] -- ---- -- [ 18 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 4 match instance #1
Length: 205, Block 1: 42, Block 2: 66
CCTACA -- (18) -- TATAAG
[ 158 ] -- ---- -- [ 134 ]


Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 4 match instance #1
Length: 148, Block 1: 85, Block 2: 109
CTTGCA -- (18) -- AAATCT
[ 58 ] -- ---- -- [ 34 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 4 match instance #1
Length: 300, Block 1: 124, Block 2: 148
CTTACT -- (18) -- AATAGT
[ 171 ] -- ---- -- [ 147 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 4 match instance #2
Length: 300, Block 1: 156, Block 2: 180
TTGACA -- (18) -- AAAAAT
[ 139 ] -- ---- -- [ 115 ]


Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 4 match instance #1
Length: 131, Block 1: 81, Block 2: 102
CTAGTA -- (15) -- TATTGG
[ 45 ] -- ---- -- [ 24 ]


Seq: EQU24_RS12095||cytochrome c
Motif 4 match instance #1
Length: 300, Block 1: 201, Block 2: 225
TTTGAA -- (18) -- TATTCT
[ 94 ] -- ---- -- [ 70 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 4 match instance #1
Length: 286, Block 1: 190, Block 2: 214
GTAATA -- (18) -- AATAAC
[ 91 ] -- ---- -- [ 67 ]

Motif 5
Block 1: GATTAA (TTAATC)
Block 2: CTTTAT (ATAAAG)
Score: 2.193, Sites: 34

Number of Seq Matches: 34


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 5 match instance #1
Length: 141, Block 1: 105, Block 2: 127
GATTTA -- (16) -- GTTGAG
[ 31 ] -- ---- -- [ 9 ]


Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 5 match instance #2
Length: 141, Block 1: 27, Block 2: 49
GTATAA -- (16) -- AATGAG
[ 109 ] -- ---- -- [ 87 ]


Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 5 match instance #1
Length: 300, Block 1: 147, Block 2: 169
GCTTAA -- (16) -- CTGTAT
[ 148 ] -- ---- -- [ 126 ]


Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 5 match instance #1
Length: 175, Block 1: 45, Block 2: 69
GCTTGA -- (18) -- CAATAT
[ 125 ] -- ---- -- [ 101 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 5 match instance #1
Length: 300, Block 1: 271, Block 2: 295
GATTTA -- (18) -- TTTGAG
[ 24 ] -- ---- -- [ 0 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 5 match instance #2
Length: 300, Block 1: 156, Block 2: 177
GTTTTA -- (15) -- CTGTAG
[ 139 ] -- ---- -- [ 118 ]


Seq: EQU24_RS03495||cold-shock protein
Motif 5 match instance #3
Length: 300, Block 1: 120, Block 2: 143
GAATAA -- (17) -- ATATAT
[ 175 ] -- ---- -- [ 152 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 5 match instance #1
Length: 300, Block 1: 264, Block 2: 287
GCATAA -- (17) -- CTGGAG
[ 31 ] -- ---- -- [ 8 ]


Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 5 match instance #2
Length: 300, Block 1: 230, Block 2: 251
GTATAC -- (15) -- CTTGAT
[ 65 ] -- ---- -- [ 44 ]


Seq: EQU24_RS15745||cold-shock protein
Motif 5 match instance #1
Length: 136, Block 1: 52, Block 2: 73
GCTTTC -- (15) -- TATTAT
[ 79 ] -- ---- -- [ 58 ]


Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 5 match instance #1
Length: 93, Block 1: 35, Block 2: 58
GATTGA -- (17) -- TCCTAT
[ 53 ] -- ---- -- [ 30 ]


Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 5 match instance #1
Length: 176, Block 1: 121, Block 2: 144
GCACTA -- (17) -- TCTTAT
[ 50 ] -- ---- -- [ 27 ]


Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 5 match instance #1
Length: 300, Block 1: 51, Block 2: 74
GTTTTA -- (17) -- AATGAG
[ 244 ] -- ---- -- [ 221 ]


Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 5 match instance #1
Length: 300, Block 1: 193, Block 2: 215
GCTCTA -- (16) -- CTATAT
[ 102 ] -- ---- -- [ 80 ]


Seq: EQU24_RS15535||hypothetical protein
Motif 5 match instance #1
Length: 300, Block 1: 210, Block 2: 232
GATCAA -- (16) -- ATTCAG
[ 85 ] -- ---- -- [ 63 ]


Seq: EQU24_RS21565||transaldolase
Motif 5 match instance #1
Length: 300, Block 1: 194, Block 2: 216
GAATTA -- (16) -- CATGAG
[ 101 ] -- ---- -- [ 79 ]


Seq: EQU24_RS21565||transaldolase
Motif 5 match instance #2
Length: 300, Block 1: 89, Block 2: 112
GATTAA -- (17) -- AAACAT
[ 206 ] -- ---- -- [ 183 ]


Seq: EQU24_RS18355||hypothetical protein
Motif 5 match instance #1
Length: 300, Block 1: 179, Block 2: 200
GATTTA -- (15) -- TTTTAT
[ 116 ] -- ---- -- [ 95 ]


Seq: EQU24_RS15705||cold-shock protein
Motif 5 match instance #1
Length: 300, Block 1: 127, Block 2: 150
GTTCGA -- (17) -- ATGGAT
[ 168 ] -- ---- -- [ 145 ]


Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 5 match instance #1
Length: 300, Block 1: 129, Block 2: 150
GAACTA -- (15) -- CTAGAT
[ 166 ] -- ---- -- [ 145 ]


Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 5 match instance #1
Length: 300, Block 1: 167, Block 2: 188
GATTAC -- (15) -- CTCTAT
[ 128 ] -- ---- -- [ 107 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 5 match instance #1
Length: 182, Block 1: 26, Block 2: 50
GCTTGA -- (18) -- CTGTAG
[ 151 ] -- ---- -- [ 127 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 5 match instance #2
Length: 182, Block 1: 64, Block 2: 85
GCTTGA -- (15) -- GTTGAT
[ 113 ] -- ---- -- [ 92 ]


Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 5 match instance #3
Length: 182, Block 1: 101, Block 2: 122
GCTTTC -- (15) -- CTACAT
[ 76 ] -- ---- -- [ 55 ]


Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 5 match instance #1
Length: 129, Block 1: 21, Block 2: 43
GCACAA -- (16) -- AATGAT
[ 103 ] -- ---- -- [ 81 ]


Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 5 match instance #1
Length: 160, Block 1: 93, Block 2: 114
GTTCGA -- (15) -- TAACAT
[ 62 ] -- ---- -- [ 41 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 5 match instance #1
Length: 205, Block 1: 168, Block 2: 189
GAATAA -- (15) -- CATGAG
[ 32 ] -- ---- -- [ 11 ]


Seq: EQU24_RS22110||hypothetical protein
Motif 5 match instance #2
Length: 205, Block 1: 135, Block 2: 158
GCTTTA -- (17) -- CTGCAT
[ 65 ] -- ---- -- [ 42 ]


Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 5 match instance #1
Length: 148, Block 1: 33, Block 2: 54
GAATGC -- (15) -- TTAGAG
[ 110 ] -- ---- -- [ 89 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 5 match instance #1
Length: 300, Block 1: 264, Block 2: 287
GATTAA -- (17) -- CAAGAG
[ 31 ] -- ---- -- [ 8 ]


Seq: EQU24_RS16195||hypothetical protein
Motif 5 match instance #2
Length: 300, Block 1: 62, Block 2: 86
GCATTA -- (18) -- AAGTAT
[ 233 ] -- ---- -- [ 209 ]


Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 5 match instance #1
Length: 131, Block 1: 19, Block 2: 41
GCATGA -- (16) -- CTTCAG
[ 107 ] -- ---- -- [ 85 ]


Seq: EQU24_RS12095||cytochrome c
Motif 5 match instance #1
Length: 300, Block 1: 117, Block 2: 138
GATTAC -- (15) -- CTTTAT
[ 178 ] -- ---- -- [ 157 ]


Seq: EQU24_RS21720||hypothetical protein
Motif 5 match instance #1
Length: 286, Block 1: 213, Block 2: 237
GAATAA -- (18) -- AAATAT
[ 68 ] -- ---- -- [ 44 ]

In [6]:
# Same information but an Altair Visualization of the motif locations along each input sequence
biop_result.view_motifs_and_locs()
In [7]:
# You can also dig into the specific objects storing the 
# motif match to a particular sequence
m1 = biop_result.motifs[1]
for sm in m1.seq_matches:
    sm.pprint()
Seq: EQU24_RS10370|acpP|acyl carrier protein
Motif 2 match instance #1
Length: 141, Block 1: 65, Block 2: 89
TTCTAA -- (18) -- CTATTG
[ 71 ] -- ---- -- [ 47 ]

Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 2 match instance #1
Length: 300, Block 1: 29, Block 2: 51
TTGACA -- (16) -- CTATTG
[ 266 ] -- ---- -- [ 244 ]

Seq: EQU24_RS02895||exosortase system-associated protein, TIGR04073 family
Motif 2 match instance #2
Length: 300, Block 1: 237, Block 2: 258
TTATAG -- (15) -- TTATAG
[ 58 ] -- ---- -- [ 37 ]

Seq: EQU24_RS19765|rnpB|RNase P RNA component class A
Motif 2 match instance #1
Length: 175, Block 1: 47, Block 2: 69
TTGACA -- (16) -- CAATAT
[ 123 ] -- ---- -- [ 101 ]

Seq: EQU24_RS03495||cold-shock protein
Motif 2 match instance #1
Length: 300, Block 1: 133, Block 2: 157
TTGAAA -- (18) -- TTTTAG
[ 162 ] -- ---- -- [ 138 ]

Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 2 match instance #1
Length: 300, Block 1: 8, Block 2: 29
TTGTCG -- (15) -- TTAGAT
[ 287 ] -- ---- -- [ 266 ]

Seq: EQU24_RS02970|pqqA|pyrroloquinoline quinone precursor peptide PqqA
Motif 2 match instance #2
Length: 300, Block 1: 59, Block 2: 83
TCAAAG -- (18) -- CTATAG
[ 236 ] -- ---- -- [ 212 ]

Seq: EQU24_RS15745||cold-shock protein
Motif 2 match instance #1
Length: 136, Block 1: 38, Block 2: 60
CTGAAA -- (16) -- GATTAG
[ 93 ] -- ---- -- [ 71 ]

Seq: EQU24_RS21665|trxA|thioredoxin TrxA
Motif 2 match instance #1
Length: 93, Block 1: 37, Block 2: 60
TTGACA -- (17) -- CTATAG
[ 51 ] -- ---- -- [ 28 ]

Seq: EQU24_RS19105|rpsT|30S ribosomal protein S20
Motif 2 match instance #1
Length: 176, Block 1: 91, Block 2: 113
TTCTCA -- (16) -- CAATAG
[ 80 ] -- ---- -- [ 58 ]

Seq: EQU24_RS21040|rpmB|50S ribosomal protein L28
Motif 2 match instance #1
Length: 300, Block 1: 187, Block 2: 210
TTCTCG -- (17) -- CTTTCG
[ 108 ] -- ---- -- [ 85 ]

Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #1
Length: 300, Block 1: 194, Block 2: 215
CTCTAA -- (15) -- CTATAT
[ 101 ] -- ---- -- [ 80 ]

Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #2
Length: 300, Block 1: 1, Block 2: 25
TTGTAA -- (18) -- TTATAT
[ 294 ] -- ---- -- [ 270 ]

Seq: EQU24_RS07185||glutamate--ammonia ligase
Motif 2 match instance #3
Length: 300, Block 1: 159, Block 2: 180
TCCTAA -- (15) -- TTATTG
[ 136 ] -- ---- -- [ 115 ]

Seq: EQU24_RS15535||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 264, Block 2: 286
TTCTAA -- (16) -- CAATTG
[ 31 ] -- ---- -- [ 9 ]

Seq: EQU24_RS21565||transaldolase
Motif 2 match instance #1
Length: 300, Block 1: 186, Block 2: 209
TCGACG -- (17) -- TTTTCG
[ 109 ] -- ---- -- [ 86 ]

Seq: EQU24_RS18355||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 235, Block 2: 257
TTCAAA -- (16) -- TAATCT
[ 60 ] -- ---- -- [ 38 ]

Seq: EQU24_RS15705||cold-shock protein
Motif 2 match instance #1
Length: 300, Block 1: 169, Block 2: 193
TCCTCG -- (18) -- TTTTAT
[ 126 ] -- ---- -- [ 102 ]

Seq: EQU24_RS19315|pmoC|methane monooxygenase/ammonia monooxygenase subunit C
Motif 2 match instance #1
Length: 300, Block 1: 159, Block 2: 181
TCATAA -- (16) -- TTTTCG
[ 136 ] -- ---- -- [ 114 ]

Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 2 match instance #1
Length: 300, Block 1: 231, Block 2: 253
TTGTAA -- (16) -- CAATAG
[ 64 ] -- ---- -- [ 42 ]

Seq: EQU24_RS18140|moxF|PQQ-dependent dehydrogenase, methanol/ethanol family
Motif 2 match instance #2
Length: 300, Block 1: 77, Block 2: 99
CTCTCG -- (16) -- TTTTCT
[ 218 ] -- ---- -- [ 196 ]

Seq: EQU24_RS15100||HU family DNA-binding protein
Motif 2 match instance #1
Length: 182, Block 1: 50, Block 2: 72
CTGTAG -- (16) -- CTTTAT
[ 127 ] -- ---- -- [ 105 ]

Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 2 match instance #1
Length: 129, Block 1: 89, Block 2: 110
TTGTCG -- (15) -- TTTTTG
[ 35 ] -- ---- -- [ 14 ]

Seq: EQU24_RS21560|fbaA|class II fructose-bisphosphate aldolase
Motif 2 match instance #2
Length: 129, Block 1: 59, Block 2: 80
CCGTAA -- (15) -- TTTGTG
[ 65 ] -- ---- -- [ 44 ]

Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 2 match instance #1
Length: 160, Block 1: 36, Block 2: 58
CTGTCG -- (16) -- TTATAT
[ 119 ] -- ---- -- [ 97 ]

Seq: EQU24_RS12525|ssrA|transfer-messenger RNA
Motif 2 match instance #2
Length: 160, Block 1: 95, Block 2: 117
TCGAAG -- (16) -- CATTTG
[ 60 ] -- ---- -- [ 38 ]

Seq: EQU24_RS22110||hypothetical protein
Motif 2 match instance #1
Length: 205, Block 1: 166, Block 2: 189
CTGAAT -- (17) -- CATGAG
[ 34 ] -- ---- -- [ 11 ]

Seq: EQU24_RS07390|rpmI|50S ribosomal protein L35
Motif 2 match instance #1
Length: 148, Block 1: 113, Block 2: 134
CTCAAG -- (15) -- TAATCG
[ 30 ] -- ---- -- [ 9 ]

Seq: EQU24_RS16195||hypothetical protein
Motif 2 match instance #1
Length: 300, Block 1: 229, Block 2: 250
CTGACT -- (15) -- TTTGAG
[ 66 ] -- ---- -- [ 45 ]

Seq: EQU24_RS18060|rplM|50S ribosomal protein L13
Motif 2 match instance #1
Length: 131, Block 1: 79, Block 2: 101
CTCTAG -- (16) -- TTATTG
[ 47 ] -- ---- -- [ 25 ]

Seq: EQU24_RS12095||cytochrome c
Motif 2 match instance #1
Length: 300, Block 1: 11, Block 2: 35
CTCACG -- (18) -- CTTGCT
[ 284 ] -- ---- -- [ 260 ]

Seq: EQU24_RS21720||hypothetical protein
Motif 2 match instance #1
Length: 286, Block 1: 205, Block 2: 227
TTCACG -- (16) -- TATTAG
[ 76 ] -- ---- -- [ 54 ]

Seq: EQU24_RS21720||hypothetical protein
Motif 2 match instance #2
Length: 286, Block 1: 10, Block 2: 34
TCCAAG -- (18) -- TATGAG
[ 271 ] -- ---- -- [ 247 ]

2. BioProspector Summary File

Section 1 shows what a single raw BioProspector file looks like. Usually, predict_promoter_signal.py will produce around 200 of these files. The SUMMARY.tsv uses the motif matches in all the raw files as "votes" for promoter candidates. The more frequently the exact same region of a sequence is identified as matching a BioProspector motif, the more "votes" it gets. This summary file counts the total number of votes received by all motif matches, grouped by the input sequence. Therefore, the top voted match region for each input sequence is our ultimate prediction for the probable promoter (-35 and -10 hexamers).

For some sequences, the voting result is very clear: there is 1 region of the sequence that gets called out way more often than any other subsequence. However, sometimes there are a couple sequence regions that may be very similar or for whatever reason, BioProspector identifies them both fairly frequently. For these, the number of votes may be much tighter between competing promoter candidates. This indicates that BioProspector was less confident about which region of the input carried the main promoter signal. We can visualize the vote distrubtion to get a sense on which inputs had more or less confident promoter calls.

In [8]:
# load the summary file
summ_df = pd.read_csv(biop_summary_f,sep='\t')
summ_df.head()
Out[8]:
seq_name block_count seq_block pos block_summ agreements
0 EQU24_RS02895||exosortase system-associated pr... 205 TTGACAACATTCAACCTTTAGGCTATTGT 29 [29]TTGACA -- (17) -- [52]TATTGT[end-->243] ['example_outdir/loci_in_top_3perc_upstream_re...
1 EQU24_RS02895||exosortase system-associated pr... 148 GTTATAGCAACTTAAATGATTGTTATAGT 236 [236]GTTATA -- (17) -- [259]TATAGT[end-->36] ['example_outdir/loci_in_top_3perc_upstream_re...
2 EQU24_RS02895||exosortase system-associated pr... 71 CCGATACATGTAGGGGGAATTGTTTGAT 119 [119]CCGATA -- (16) -- [141]TTTGAT[end-->154] ['example_outdir/loci_in_top_3perc_upstream_re...
3 EQU24_RS02895||exosortase system-associated pr... 63 TTGACAACATTCAACCTTTAGGCTATT 29 [29]TTGACA -- (15) -- [50]GCTATT[end-->245] ['example_outdir/loci_in_top_3perc_upstream_re...
4 EQU24_RS02895||exosortase system-associated pr... 62 GTGAAAACTCTTTGGGTCGGAGTTATA 215 [215]GTGAAA -- (15) -- [236]GTTATA[end-->59] ['example_outdir/loci_in_top_3perc_upstream_re...
In [9]:
def get_rank_color(rank):
    '''
    Custom colors to mark the first, second, and third ranked promoter predictions
    '''
    if int(rank) == 1:
        return sns.xkcd_rgb['bright blue']
    elif int(rank) == 2:
        return sns.xkcd_rgb['bright pink']
    elif int(rank)== 3: 
        return sns.xkcd_rgb['apple green']
    else:
        return "gray"
    
def vote_summary_plot(df):
    '''
    Given a BioProspector summary file as a dataframe, plot the vote counts.
    Highlight the top 3 ranked promoters for each input sequence as a way to
    convey which votes were close vs clear.
    '''
    fig, axes = plt.subplots(nrows=20, ncols=4, sharex=True, sharey=True, figsize=(15,70))
    axes_list = [item for sublist in axes for item in sublist] 

    for seq_name, sub_df in df.groupby("seq_name"):
        # calculate the rank of each match by vote count
        sub_df['rank'] = sub_df['block_count'].rank(ascending=False)
        color_pal = [get_rank_color(x) for x in sub_df['rank'].values]

        # make the bar chart on the next axis
        ax = axes_list.pop(0)
        sns.barplot(data=sub_df,x='block_summ',y='block_count',palette=color_pal,ax=ax)

        # draw horizontal lines for the top 3 ranks
        count_order = sorted(sub_df['block_count'].values,reverse=True)
        first_line_h = count_order[0]
        second_line_h = count_order[1]
        third_line_h = count_order[2]

        ax.axhline(first_line_h,color=sns.xkcd_rgb['bright blue'])
        ax.axhline(second_line_h,color=sns.xkcd_rgb['bright pink'])
        ax.axhline(third_line_h,color=sns.xkcd_rgb['green apple'])

        # axis and title configs
        ax.set_title('\n'.join(wrap(seq_name,30)))#.split('|')[0])
        ax.set_xticks([]) 
        ax.set_xlabel("BioP predicted promoters")
        ax.set_ylabel("BioP votes")

    # Now use the matplotlib .remove() method to 
    # delete anything we didn't use
    for ax in axes_list:
        ax.remove()

    fig.tight_layout()
In [10]:
# filter out predictions with fewer than 5 votes
summ_df_filt = summ_df[summ_df['block_count'] > 5]
In [11]:
vote_summary_plot(summ_df_filt)

Each bar is a different predicted promoter identified and voted on by motif matches found in BioProspector. Votes are tabulated on the y-axis and promoters are ordered on the x-axis by their votes. The top 3 voted promoters have colored lines (First place = Blue, Second place = Pink, Third place = Green). This small multiples plot helps to give a quick sense of which input sequences contained a very clear winner (blue lines with a lot of separation between itself and the second place pink line indicated a sequence where a primary promoter region was found by BioProspector many times over) vs where the race was tighter (where blue and pink and sometimes green lines are quite close represent a sequence where there were multiple regions BioProspector identified as being likely promoters and it had trouble choosing between them).

The final SELECTION.fa file always reports the top voted promoter prediction, however inspecting these plots may reveal tight races where a user would like to manually inspect the second/third place predictions.

To manually inspect these results, the user can consult the TOP_K_MOV.tsv file. By default, the top 3 predictions are reported however predict_promoter_signal.py can be passed an argument to output more or less than 3.

3. Inspect Top K Margin of Victory file

The TOP_K_MOV.tsv file is a shorter version of the SUMMARY.tsv (which reports every prediction which received even 1 vote). The TOP_K_MOV.tsv is just the Top K predictions (3 by default). The primary purpose of this file is to be more convenient for a human to inspect (and not have to scroll through tons of poor predictions) however the important data in this file are reflected in the colored horizontal lines in the plots in Section 2 above, so additional visualizations are not included in this section. The other small difference in the TOP_K_MOV.tsv (besides having many fewer predictions) is that it also reports the "Margin of Victory", which is the difference in votes between the first place prediction and second place prediction (or difference between second place and third place, current place and next place, etc). Small margins of victory indicate a less stable voting outcome whereas large margins of victories are a more robust ordering of promoter predictions.

In [12]:
mov_df = pd.read_csv(biop_mov_f,sep='\t')
mov_df
Out[12]:
loc sequence pos margin_of_victory motif_summ raw_votes
0 EQU24_RS02895||exosortase system-associated pr... TTGACAACATTCAACCTTTAGGCTATTGT 29 57 [29]TTGACA -- (17) -- [52]TATTGT[end-->243] 205
1 EQU24_RS02895||exosortase system-associated pr... GTTATAGCAACTTAAATGATTGTTATAGT 236 77 [236]GTTATA -- (17) -- [259]TATAGT[end-->36] 148
2 EQU24_RS02895||exosortase system-associated pr... CCGATACATGTAGGGGGAATTGTTTGAT 119 8 [119]CCGATA -- (16) -- [141]TTTGAT[end-->154] 71
3 EQU24_RS02970|pqqA|pyrroloquinoline quinone pr... TTGCTTTGCCTAAATTATCGTCGTATACT 208 57 [208]TTGCTT -- (17) -- [231]TATACT[end-->64] 146
4 EQU24_RS02970|pqqA|pyrroloquinoline quinone pr... CCTATAGGCTCTATGCCGGCTCTATGCT 82 2 [82]CCTATA -- (16) -- [104]TATGCT[end-->191] 89
... ... ... ... ... ... ...
70 EQU24_RS21720||hypothetical protein TTGTCACAATTCCCTAACTTTTAACTTGCT 97 27 [97]TTGTCA -- (18) -- [121]CTTGCT[end-->160] 95
71 EQU24_RS21720||hypothetical protein TTGTAAGCATAGGCTTACACCGGTAAGCT 150 9 [150]TTGTAA -- (17) -- [173]TAAGCT[end-->108] 68
72 EQU24_RS22110||hypothetical protein TTGATATTGCGGCAATCTACGTTAGAAT 92 64 [92]TTGATA -- (16) -- [114]TAGAAT[end-->86] 160
73 EQU24_RS22110||hypothetical protein CCTACAAATATACTTGGTTGAATTTATAAG 42 38 [42]CCTACA -- (18) -- [66]TATAAG[end-->134] 96
74 EQU24_RS22110||hypothetical protein CTGAATAATAGTTACTATGACAACATGAG 166 13 [166]CTGAAT -- (17) -- [189]CATGAG[end-->11] 58

75 rows × 6 columns

4. Inspect BioProspector best promoter Selection file

The highest voted prediction for each input sequence is collected in a fasta file (SELECTION.fa). Users may decide they prefer the 2nd or 3rd place prediction after viewing the above charts and inspecting the MOV.tsv or SUMMARY.tsv files. In that case, a user can simply replace the sequence in SELECTION.fa with the sequence from their preferred prediction.

This fasta file contains the exact region of the input sequence corresponding to BioProspector's top promoter prediction (a -35 hexamer, followed by a spacer 15-18 bp, followed by a -10 hexamer). Therefore, the first 6 bases and the last 6 bases of each sequence are the hexamer calls.

The most basic visualization from the SELECTION.fa file is to create a consensus motif from all the predicted promoters.

In [13]:
motif_blocks, m1, m2 = cu.build_2Bmotif_from_selection_file(biop_selection_f)
In [14]:
print(f"Block 1 Consensus: {m1.consensus}")
print(f"Block 1 Anti-Consensus: {m1.anticonsensus}")
print(f"Block 2 Consensus: {m2.consensus}")
print(f"Block 2 Anti-Consensus: {m2.anticonsensus}")
Block 1 Consensus: TTGACA
Block 1 Anti-Consensus: AAAGGC
Block 2 Consensus: TATAAT
Block 2 Anti-Consensus: ACCCTC
In [ ]: