The predict promoter_signal.py
workflow generates 3 main output files after running BioProspector N
times as well as N
raw files (BioProspector's direct program output). This notebook demonstrates how to inspect the output files and understand and interpret their contents.
import altair as alt
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textwrap import wrap
import warnings; warnings.simplefilter('ignore')
import sys
sys.path.append('../') # use modules in main directory
import bioprospector_utils as bu
from bioprospector_utils import BioProspectorResult,SeqMotifMatch_2B,Motif_2B
import consensus_viz_utils as cu
# indicate the bioprospector output files and
# directories created during predict_promoter_signal.py
promoter_f = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc.fa'
biop_raw_dir = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_BIOP_RAW/'
biop_summary_f = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_SUMMARY.tsv'
biop_mov_f = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_TOP_3_MOV.tsv'
biop_selection_f = '../example_outdir/loci_in_top_3perc_upstream_regions_w300_min20_trunc_W6_w6_G18_g15_d1_a1_n200_1604734841_SELECTION.fa'
Each run of BioProspector produces a raw file, which is parsed during predict_promoter_signal.py
to extract the information about the sequence matches to the motifs it finds. Looking at the raw .txt file is fairly human readable. This notebook section just shows some convenient functions for inspecting it via custom Python data structures.
# pick one of the raw BioProspector files from the raw folder
biop_raw_f = os.path.join(biop_raw_dir, 'biop_run1.txt')
# load it into a BioProspectorResult object
biop_result = BioProspectorResult(biop_raw_f,promoter_f)
print("This BioProspector Result file reports:")
print(f"* {len(biop_result.motifs)} found motifs")
# view motif logos
biop_result.view_motifs()
# View further details of the consensus and location of
# where this motif was identified in each of the input sequences
for m in biop_result.motifs:
m.view_motif()
m.pprint()
# Same information but an Altair Visualization of the motif locations along each input sequence
biop_result.view_motifs_and_locs()
# You can also dig into the specific objects storing the
# motif match to a particular sequence
m1 = biop_result.motifs[1]
for sm in m1.seq_matches:
sm.pprint()
Section 1 shows what a single raw BioProspector file looks like. Usually, predict_promoter_signal.py
will produce around 200 of these files. The SUMMARY.tsv uses the motif matches in all the raw files as "votes" for promoter candidates. The more frequently the exact same region of a sequence is identified as matching a BioProspector motif, the more "votes" it gets. This summary file counts the total number of votes received by all motif matches, grouped by the input sequence. Therefore, the top voted match region for each input sequence is our ultimate prediction for the probable promoter (-35 and -10 hexamers).
For some sequences, the voting result is very clear: there is 1 region of the sequence that gets called out way more often than any other subsequence. However, sometimes there are a couple sequence regions that may be very similar or for whatever reason, BioProspector identifies them both fairly frequently. For these, the number of votes may be much tighter between competing promoter candidates. This indicates that BioProspector was less confident about which region of the input carried the main promoter signal. We can visualize the vote distrubtion to get a sense on which inputs had more or less confident promoter calls.
# load the summary file
summ_df = pd.read_csv(biop_summary_f,sep='\t')
summ_df.head()
def get_rank_color(rank):
'''
Custom colors to mark the first, second, and third ranked promoter predictions
'''
if int(rank) == 1:
return sns.xkcd_rgb['bright blue']
elif int(rank) == 2:
return sns.xkcd_rgb['bright pink']
elif int(rank)== 3:
return sns.xkcd_rgb['apple green']
else:
return "gray"
def vote_summary_plot(df):
'''
Given a BioProspector summary file as a dataframe, plot the vote counts.
Highlight the top 3 ranked promoters for each input sequence as a way to
convey which votes were close vs clear.
'''
fig, axes = plt.subplots(nrows=20, ncols=4, sharex=True, sharey=True, figsize=(15,70))
axes_list = [item for sublist in axes for item in sublist]
for seq_name, sub_df in df.groupby("seq_name"):
# calculate the rank of each match by vote count
sub_df['rank'] = sub_df['block_count'].rank(ascending=False)
color_pal = [get_rank_color(x) for x in sub_df['rank'].values]
# make the bar chart on the next axis
ax = axes_list.pop(0)
sns.barplot(data=sub_df,x='block_summ',y='block_count',palette=color_pal,ax=ax)
# draw horizontal lines for the top 3 ranks
count_order = sorted(sub_df['block_count'].values,reverse=True)
first_line_h = count_order[0]
second_line_h = count_order[1]
third_line_h = count_order[2]
ax.axhline(first_line_h,color=sns.xkcd_rgb['bright blue'])
ax.axhline(second_line_h,color=sns.xkcd_rgb['bright pink'])
ax.axhline(third_line_h,color=sns.xkcd_rgb['green apple'])
# axis and title configs
ax.set_title('\n'.join(wrap(seq_name,30)))#.split('|')[0])
ax.set_xticks([])
ax.set_xlabel("BioP predicted promoters")
ax.set_ylabel("BioP votes")
# Now use the matplotlib .remove() method to
# delete anything we didn't use
for ax in axes_list:
ax.remove()
fig.tight_layout()
# filter out predictions with fewer than 5 votes
summ_df_filt = summ_df[summ_df['block_count'] > 5]
vote_summary_plot(summ_df_filt)
Each bar is a different predicted promoter identified and voted on by motif matches found in BioProspector. Votes are tabulated on the y-axis and promoters are ordered on the x-axis by their votes. The top 3 voted promoters have colored lines (First place = Blue, Second place = Pink, Third place = Green). This small multiples plot helps to give a quick sense of which input sequences contained a very clear winner (blue lines with a lot of separation between itself and the second place pink line indicated a sequence where a primary promoter region was found by BioProspector many times over) vs where the race was tighter (where blue and pink and sometimes green lines are quite close represent a sequence where there were multiple regions BioProspector identified as being likely promoters and it had trouble choosing between them).
The final SELECTION.fa file always reports the top voted promoter prediction, however inspecting these plots may reveal tight races where a user would like to manually inspect the second/third place predictions.
To manually inspect these results, the user can consult the TOP_K_MOV.tsv file. By default, the top 3 predictions are reported however predict_promoter_signal.py
can be passed an argument to output more or less than 3.
The TOP_K_MOV.tsv file is a shorter version of the SUMMARY.tsv (which reports every prediction which received even 1 vote). The TOP_K_MOV.tsv is just the Top K predictions (3 by default). The primary purpose of this file is to be more convenient for a human to inspect (and not have to scroll through tons of poor predictions) however the important data in this file are reflected in the colored horizontal lines in the plots in Section 2 above, so additional visualizations are not included in this section. The other small difference in the TOP_K_MOV.tsv (besides having many fewer predictions) is that it also reports the "Margin of Victory", which is the difference in votes between the first place prediction and second place prediction (or difference between second place and third place, current place and next place, etc). Small margins of victory indicate a less stable voting outcome whereas large margins of victories are a more robust ordering of promoter predictions.
mov_df = pd.read_csv(biop_mov_f,sep='\t')
mov_df
The highest voted prediction for each input sequence is collected in a fasta file (SELECTION.fa). Users may decide they prefer the 2nd or 3rd place prediction after viewing the above charts and inspecting the MOV.tsv or SUMMARY.tsv files. In that case, a user can simply replace the sequence in SELECTION.fa with the sequence from their preferred prediction.
This fasta file contains the exact region of the input sequence corresponding to BioProspector's top promoter prediction (a -35 hexamer, followed by a spacer 15-18 bp, followed by a -10 hexamer). Therefore, the first 6 bases and the last 6 bases of each sequence are the hexamer calls.
The most basic visualization from the SELECTION.fa file is to create a consensus motif from all the predicted promoters.
motif_blocks, m1, m2 = cu.build_2Bmotif_from_selection_file(biop_selection_f)
print(f"Block 1 Consensus: {m1.consensus}")
print(f"Block 1 Anti-Consensus: {m1.anticonsensus}")
print(f"Block 2 Consensus: {m2.consensus}")
print(f"Block 2 Anti-Consensus: {m2.anticonsensus}")