Explore and visualize TPM data¶

Given the TPM data matrix a user provides to get_top_gene_set.py, take some time to visually inspect each step of the processing pipeline and visualize some helpful summaries of the data.

import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from Bio import SeqIO

import sys
sys.path.append('../') # use modules in main directory
import genbank_utils as gu
import get_top_gene_set as gtgs

Load expression data¶

Use the same data loading function from get_top_gene_set.py

# load TPM and feature data
data_file = '../data/extract_TPM_counts.tsv'
sample2condition_file = '../data/sample2condition.txt'
sample_file = None #'../config/samples_to_include.txt'
condition_file = '../config/conditions_to_include.txt'
gb_file = '../data/5GB1c_sequence.gb'

df, sample2condition, samples, conditions, pos_feats, neg_feats = gtgs.load_data(data_file, 
                                                                                 sample2condition_file, 
                                                                                 sample_file, 
                                                                                 condition_file,
                                                                                 gb_file)

df.head()

feat2meta = gu.get_feat2meta_dict(gb_file)
feat2meta['EQU24_RS19310']

{'gene_symbol': 'pmoA',
 'product': 'methane monooxygenase/ammonia monooxygenase subunit A',
 'type': 'CDS'}

Determine loci which may be in operons¶

Use the operon flagging function to call out which loci in the genome seem to be too close together on the same strand. The choice of 120 was determined for this organism (Methylotubimicrobium buryatense 5GB1) by exploring known operons inter-gene distances. See operon_distance_exploration.ipynb for more details.

op_min_dist = 120
maybe_operon_loci = gtgs.flag_potential_operon_loci(pos_feats, neg_feats, op_min_dist)

print(f"Number of possible operon loci at distance of {op_min_dist}bp: {len(maybe_operon_loci)}/{len(pos_feats)+len(neg_feats)}")

Number of possible operon loci at distance of 120bp: 2068/4431

Transform the data frame for mean TPM calculations¶

Pivot the data matrix and combine samples that belong to the same experimental condition. Values below are averages across samples.

# for the loaded TPM matrix, which column contains the unique gene ids?
LOCUS_ID_COL = 'locus_tag'
# use this column to get a full list of all genes for which expression was measured
LOCI = list(df[LOCUS_ID_COL].values)

df_means = gtgs.get_average_tpm_by_condition(df,samples,conditions,sample2condition,LOCI)
df_means

Example extraction of the top 3% of loci across all conditions¶

The printed output is of the same format of the output of get_top_gene_sets.py

# get top locus ids in all conditions
n = 3
top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, n)

# filter out loci maybe in operons
top_locs_op_filter_out = [x for x in top_locs if x in maybe_operon_loci]

# print top locus results
print(f'Top {n}%: {len(top_locs)} ({len(top_locs) - len(top_locs_op_filter_out)} filtered)\n')
print("Locus\toperon?")
for loc in top_locs:
    print(f'{loc}\t{loc in top_locs_op_filter_out}')

Top 3%: 37 (25 filtered)

Locus	operon?
EQU24_RS09110	True
EQU24_RS15745	False
EQU24_RS07185	False
EQU24_RS18355	False
EQU24_RS21560	False
EQU24_RS03495	False
EQU24_RS21040	False
EQU24_RS15535	False
EQU24_RS19765	False
EQU24_RS07385	True
EQU24_RS19105	False
EQU24_RS10370	False
EQU24_RS18130	True
EQU24_RS02970	False
EQU24_RS18125	True
EQU24_RS04160	True
EQU24_RS07390	False
EQU24_RS16195	False
EQU24_RS02265	True
EQU24_RS22055	True
EQU24_RS18060	False
EQU24_RS22110	False
EQU24_RS19305	True
EQU24_RS12965	True
EQU24_RS19310	True
EQU24_RS12095	False
EQU24_RS18140	False
EQU24_RS15705	False
EQU24_RS21720	False
EQU24_RS12525	False
EQU24_RS15100	False
EQU24_RS19315	False
EQU24_RS02240	True
EQU24_RS21565	False
EQU24_RS21665	False
EQU24_RS18135	True
EQU24_RS02895	False

Visualize tradeoffs in top N% threshold used¶

The above example uses 3%, but how do you pick that threshold? There's a tradeoff between the number of genes in your top set and the TPM expression of genes in the top. If you are more strict (smaller n), your top set will only consist of the very top genes, however you might not get very many of them. If you are more lenient (choose larger n), you'll get more top genes, but then you start to sample genes which on average reach rather low TPM values...

The following visuals are to help show the trade off and help you decide how the minimum TPM expression you will tolerate in your "top gene set" while showing you how many genes that qualify as being in the top.

We also build a tsv showing which genes qualify as "top" at each n% threshold.

tradeoff_data = []
top_n_loci_qualifers = []

# examine top sets using a range of top 1% to top 10%
for i in range(1,11):
    # get top locus ids in all conditions
    top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, i)
    top_locs_op_filter_out = [x for x in top_locs if x in maybe_operon_loci]
    
    # get the average expression of each locus across conditions
    means = []
    for loc in top_locs:
        mean_exp = np.mean(df_means[loc].values)
        means.append(mean_exp)
        
    # add row to the data frame consisting of:
    # n: top % threshold
    # locus count: number of top loci identified
    # locus count filtered: locus count after filtering operon genes
    # min expression: the minimum avg expression value from the top gene set
    row = [i,
           len(top_locs),
           len(top_locs)-len(top_locs_op_filter_out),
           min(means)
          ]
    
    tradeoff_data.append(row)
    
    # also, add all these loci to the top n gene sets
    for loc in top_locs:
        g = feat2meta[loc]['gene_symbol']
        p = feat2meta[loc]['product']
        t = feat2meta[loc]['type']
        op = loc in top_locs_op_filter_out
        i
        top_n_loci_qualifers.append([loc,g,p,t,op,i])
    
tradeoff_df = pd.DataFrame(tradeoff_data, columns=['n','loc_count','loc_count_filt','min_exp'])
all_top_n_df = pd.DataFrame(top_n_loci_qualifers, columns=['locus_tag','gene_symbol','product','type','op?','top_perc']).groupby('locus_tag').first().reset_index().sort_values('top_perc')

tradeoff_df

all_top_n_df.head(20)

# write out tsv of top loci and which threshold at which they qualify as "top"
all_top_n_df.to_csv('top_n_loci_qualifiers.tsv',sep='\t',index=False)

def tradeoff_plot(df):
    # number of genes in the top set
    num_genes_layer = alt.Chart(df).mark_circle(size=100).encode(
        x=alt.X('loc_count:Q', axis=alt.Axis(title='Number of loci in top n%',grid=False)),
        y=alt.Y('min_exp:Q', axis=alt.Axis(title='Min. mean TPM of loci in top n%',grid=False)),
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
        tooltip=[alt.Tooltip('n:Q', title='Top n%'),
                 alt.Tooltip('loc_count:Q', title='Num loci in top n%'),
                 alt.Tooltip('loc_count_filt:Q', title='Num loci in top n% after operon filtering'),
                 alt.Tooltip('min_exp:Q', title='Min. ave. expr. in top n%')
                ]
    )

    # number of genes in the top set after filtering operon genes
    op_ex_layer = alt.Chart(df).mark_point(size=100).encode(
        x='loc_count_filt:Q',
        y='min_exp:Q',
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
        tooltip=[alt.Tooltip('n:Q', title='Top N%'),
                 alt.Tooltip('loc_count:Q', title='Num loci in top n%'),
                 alt.Tooltip('loc_count_filt:Q', title='Num loci in top n% after operon filtering'),
                 alt.Tooltip('min_exp:Q', title='Min. ave. expr. in top n%')
                 ]
    )

    # connecting line between top set total and operon-filtered sets
    rule = alt.Chart(df).mark_rule().encode(
        x='loc_count_filt:Q',
        x2='loc_count:Q',
        y='min_exp:Q',
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
    )

    # top n% text label
    text = alt.Chart(df).mark_text(
        align='left',
        baseline='middle',
        angle=335,
        dx=5,
        dy=-7,
        size=14,
        color='black'
    ).encode(
        x='loc_count:Q',
        y='min_exp:Q',
        text=alt.Text('n:N'),
        #color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
    ).transform_calculate(n='"Top " + datum.n + "%"')

    # combine layers into chart
    chart = num_genes_layer + op_ex_layer + rule + text

    # final chart configs
    chart = chart.properties(
        title='Tradeoff between count and expression of top n% threshold'
    ).configure_title(
        fontSize=16,
    ).configure_axis(
        labelFontSize=14,
        titleFontSize=14
    ).configure_view(
        strokeOpacity=0
    ).interactive()
    
    #chart.save('tradeoff.html')
    return chart

Interactive plot!¶

Instructions:

* hover over points to see the exact TPM and count values
* pan and zoom with mouse and scroll

tradeoff_plot(tradeoff_df)

This plot shows the tradeoff between number of genes in the top set (on the x-axis) and the mean TPM expression of the minimum gene in the top set (so the TPM value of the weakest gene that still qualified as being in the top n%). The filled circles represent the total number of genes in the top while the open circles are the same set but with loci-possibly-in-operons (because they were too close to another locus on the same strand) filtered out. So it is a subset of the filled circle.

Visualize gene expression trends across conditions¶

The above visuals show a summary of all loci after being grouped into top sets at varying thresholds. The next visuals will instead explore the individual loci TPM profiles across conditions and compare the top set to the "rest" (all loci that did not make the top set).

Add relevant metadata to the expression df¶

# melt df to format for Altair parallel coordinates viz
alt_df = pd.melt(df_means, id_vars=['exp_condition'], value_vars=LOCI,value_name='mean_exp')

# add metadata columns from dict
alt_df['gene_symbol'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['gene_symbol'])
alt_df['product'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['product'])
alt_df['type'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['type'])
alt_df['desc_string']  = alt_df.apply(lambda row: f"{row['locus_tag']}|{row['gene_symbol']}|{row['product']}",axis=1)

# impose a specific x-axis sort order on the df
# choose custom exp_condition order!
list_ordering = ['uMax','lowCH4','NoCu','lowCu','medCu','highCu','NO3_lowO2_slow_growth','highO2_slow_growth','lowO2_fast_growth','MeOH','NoLanthanum','WithLanthanum'] 
alt_df["exp_condition_order"] = pd.Categorical(alt_df["exp_condition"], categories=list_ordering)
alt_df['exp_id'] = alt_df['exp_condition'].apply(lambda x: list_ordering.index(x))
alt_df.head()

def view_df(df):
    '''Small function to display df gene info '''
    print(f"Gene count: {len(df['locus_tag'].unique())}")
    return df[['locus_tag','gene_symbol','product','type']].drop_duplicates().sort_values('locus_tag').reset_index()

view_df(alt_df)

Gene count: 4213

For a given set of top genes, visualize the expression across conditions¶

As an example, use 3 sets of top loci: 1%, 3% and 10%

top_locs_1_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 1)
df_1 = alt_df[alt_df['locus_tag'].isin(top_locs_1_perc)]

top_locs_3_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 3)
df_3 = alt_df[alt_df['locus_tag'].isin(top_locs_3_perc)]

top_locs_10_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 10)
df_10 = alt_df[alt_df['locus_tag'].isin(top_locs_10_perc)]

def altair_pcoords_plot_select_legend_and_highlight(df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    
    # selections
    highlight = alt.selection(type='single', on='mouseover',
                              fields=['desc_string'], nearest=True)
    
    selection = alt.selection_multi(fields=['desc_string'], bind='legend')

    
    # base?
    title=f'Average TPM of top {n}% of loci across conditions'
    base = alt.Chart(
        df.sort_values(xorder), 
        title=title,
        #titleFontSize=20
    ).encode(
        x=alt.X(f'{xorder}:N',
                sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM expression')),
        size=alt.value(100),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title='Gene', 
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )), 
    )
    
    # lines
    lines = base.mark_line().encode(
        size=alt.condition((selection|highlight), alt.value(3), alt.value(1)),
        opacity=alt.condition((selection|highlight), alt.value(1), alt.value(0.5))
    ).add_selection(
        selection,
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    # points
    points = base.mark_circle().encode(
        tooltip=['locus_tag','product','type','gene_symbol','desc_string'],
        opacity=alt.condition((selection|highlight), alt.value(1), alt.value(0.2)),
        size=alt.condition((selection|highlight), alt.value(100), alt.value(3))
    ).add_selection(highlight)

    chart = lines + points 
    
    chart = chart.configure_axis(
        labelFontSize=14,
        titleFontSize=14
    ).properties(
        title=f"Expression strength of loci in top {n}%"
    ).configure_title(
        fontSize=16,
    ).interactive()
    
    #chart.save(f"pcoords_top{n}.html")
    return chart

Interactive Visualization!¶

This plot shows individual gene expression profiles across the 12 conditions used in this analysis.

Instructions:

* hover over lines to show gene info. 
* select colors in the legend to highlight specific genes
* shift click to select multiple genes from the legend
* pan and zoom with mouse and scroll

Top 1%¶

altair_pcoords_plot_select_legend_and_highlight(df_1,"1")

# df of genes in top 1%
view_df(df_1)

Gene count: 14

Top 3%¶

altair_pcoords_plot_select_legend_and_highlight(df_3,"3")

# df view of genes in top 3%
view_df(df_3)

Gene count: 37

Visualize top loci TPM trends relative to the remaining loci not selected in the top¶

Now let's compare the top set to the rest of the loci - did we really pick a super strong set?

To explore this visually, we with still plot each locus with its own line, but now we will add grey bands showing the first and second standard deviations of the remaining genes. The mean of the "rest" is also depicted as a dotted line.

def get_stdev_bounds(df_comp):
    '''
    Given a dataframe of TPM expression for a top set complement (aka
    all loci NOT in the top set), determin the mean and first/second
    standard deviations
    '''
    rows = []
    for cond, cond_df in df_comp.groupby('exp_condition'):
        vals = cond_df['mean_exp'].values
        m = np.mean(vals)
        sd = np.std(vals)
        sd1_high = m + sd
        sd1_low = max((m - sd), 0)
        sd2_high = m + (2*sd)
        sd2_low = max((m - (2*sd)),0)
        #print(cond, m, sd)
        
        rows.append([cond, m, sd, sd1_high, sd1_low, sd2_high, sd2_low])
        
    return pd.DataFrame(rows, columns=['exp_condition','mean_exp','stdev','stdev1_high','stdev1_low','stdev2_high','stdev2_low'])


def stdev_plot(n,short=False,save=False):
    '''
    Given an n% threshold, get the top locs for this gene set,
    get it's complement and produce
    a lineplot of the top genes vs the 1st/2nd standard deviation
    of the remaining genes
    '''
    
    # get loci in top n% and filter altair-formatted df
    top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, n)
    df_n = alt_df[alt_df['locus_tag'].isin(top_locs)]
    # num loci in the top - used for plotting color palette
    num_loci = len(df_n['locus_tag'].unique())
    
    # get complement of top locs (everything NOT in the top set)
    df_n_comp = alt_df[~alt_df['locus_tag'].isin(top_locs)]
    
    # get stdev boundary df of complement
    df_n_comp_stdev = get_stdev_bounds(df_n_comp)
    df_n_comp_stdev["exp_condition_order"] = pd.Categorical(df_n_comp_stdev["exp_condition"], categories=list_ordering)
    df_n_comp_stdev['exp_id'] = df_n_comp_stdev['exp_condition'].apply(lambda x: list_ordering.index(x))
    
    # if more than 40 lines are drawn, make them slightly transparent
    alpha = 0.2 if len(df_n['desc_string'].unique()) > 40 else 1

    # +--------------+
    # | start figure |
    # +--------------+
    fig_dim = (8,4) if short else (8,6)
    fig = plt.figure(figsize=fig_dim)
    
    # top loci lines
    sns.lineplot(data=df_n, x='exp_id', y='mean_exp',hue='desc_string', 
                 palette=sns.color_palette("hls",num_loci),alpha=alpha,legend=None)
    
    # get x-axis order for stdev fill
    x_ax = df_n_comp_stdev.sort_values('exp_id')['exp_id'].values
    
    # mean of not-top loci
    #sns.lineplot(data=df_n_comp, x='exp_condition_order', y='mean_exp',color='black',lw=3)
    comp_means = df_n_comp_stdev.sort_values('exp_condition_order')['mean_exp'].values
    plt.plot(x_ax,comp_means,color='black',linestyle='--',lw=3)

    

    # plot second stdev
    lower_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev2_low'].values
    upper_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev2_high'].values
    plt.fill_between(x_ax,
                     lower_bound,
                     upper_bound,
                     color=sns.xkcd_rgb['light grey']
                    )

    # first stdev
    lower_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev1_low'].values
    upper_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev1_high'].values
    plt.fill_between(x_ax,
                     lower_bound,
                     upper_bound,
                     color=sns.xkcd_rgb['greyish']
                    )


    # chart config 
    plt.yscale('log')
    plt.ylim(1)
    plt.title(f"Expression strength of loci in top {n}% compared to rest", fontsize=18)
    plt.xticks(x_ax,fontsize=14)
    plt.yticks(fontsize=14)
    plt.xlabel("Experimental condition ID",fontsize=14)
    plt.ylabel("Log TPM expression",fontsize=14)
    sns.despine(bottom = False, left = False)
    #plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    if save:
        short_str = '_short' if short else ''
        plt.savefig(f'figs/top{n}_loci_vs_rest{short_str}.eps', format='eps')
        plt.savefig(f'figs/top{n}_loci_vs_rest{short_str}.png', format='png')
    
    plt.show()

stdev_plot(1)

The black dotted line is the mean of the "rest" of the genes in each condition, the dark grey is the 1st standard deviation, and the light grey is the 2nd standard deviation

stdev_plot(3)

stdev_plot(10)

Experimenting with other visualization styles¶

I tried a few other versions for visualizing the summary of the "remaining loci." They each have pros and cons but ultimately I thought the standard deviation bands (in the above visualizations) mostly cleanly communicated the message.

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

# get complement of top n% of genes
df_1_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_1_perc)]
#df_3_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_3_perc)]
#df_10_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_10_perc)]

def overlapping_lines(df,comp_df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    title=f'Average TPM of top {n}% of loci across conditions'
    

    # lines
    lines = alt.Chart(
        df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log')),
        size=alt.value(5),
        opacity=alt.value(0.8),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title=f'Loci in top {n}%', 
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )) 
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    
    # complement
    lines_comp = alt.Chart(
        comp_df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM Expression')),
        size=alt.value(0.5),
        opacity=alt.value(0.1),
        detail='desc_string:N',
        color=alt.value('gray')
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    
    chart = lines + lines_comp
    
    chart = chart.configure_axis(
        grid=False
    ).configure_view(
        strokeOpacity=0
    ).properties(
        title=f"Expression strength of loci in top {n}% compared to rest"
    ).configure_title(
        fontSize=16,
    )
    
    #chart.save(f"pcoords_overlap_top{n}.html")
    return chart

overlapping_lines(df_1,df_1_comp,'1')

I liked that this visual displays transparent versions of all the other loci's lines, and thus the dark band in the middle is a good indication of where most loci TPM levels are. However is has a slightly spaghetti feel...

def lines_with_box(df,comp_df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    title=f'Average TPM of top {n}% of loci across conditions'
    

    # lines
    lines = alt.Chart(
        df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM Expression')),
        size=alt.value(5),
        opacity=alt.value(0.8),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title=f'Loci in top {n}%',
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )) 
    ).properties(
         width=600,
         height=400
    )
    
    
    # complement
    box = alt.Chart(
        comp_df.sort_values(xorder)
    ).mark_boxplot().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field="exp_condition_order"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log')),
        color=alt.value('gray')
    ).properties(
         width=600,
         height=400
    )    
    chart = lines + box
    
    chart = chart.configure_axis(
        grid=False
    ).configure_view(
        strokeOpacity=0
    ).properties(
        title=f"Expression strength of loci in top {n}% compared to rest"
    ).configure_title(
        fontSize=16,
    )
    
    #chart.save(f"pcoords_box_top{n}.html")
    return chart

lines_with_box(df_1,df_1_comp,"1")

This is the same data but summarized with box plots. I'm not a huge fan of the Altair rendering

fig = plt.figure(figsize=(10,10))
sns.lineplot(data=df_1, x='exp_condition_order', y='mean_exp',hue='desc_string')
sns.boxplot(data=df_1_comp,x='exp_condition_order',y='mean_exp')

plt.yscale('log')
plt.xticks(rotation=90)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Same plot as above but with seaborn. I like the boxes better aesthetically but the "outlier" dots are still a bit confusing. Why aren't those genes in the top? Well, the dot is high in one condition, but it goes way low in other conditions, therefore it does not qualify as being "top" in ALL conditions. But the lines aren't connected here so it's hard to see the drop.

fig = plt.figure(figsize=(10,10))
sns.lineplot(data=df_1, x='exp_condition_order', y='mean_exp',hue='desc_string')
v = sns.violinplot(data=df_1_comp,x='exp_condition_order',y='mean_exp')

#v.yaxis.set_ticks([np.log10(x) for p in range(-6,6) for x in np.linspace(10**p, 10**(p+1), 10)])

plt.yscale('log')
#plt.ylim(-100,600000)
#plt.yticks([np.log10(-6),np.log10(-5),np.log10(-4),np.log10(-3),np.log10(-2),np.log10(-1),np.log10(0),np.log10(1),np.log10(2),np.log10(3),np.log10(4),np.log10(5),np.log10(6)])
#plt.yticks([10**x for x in [-15,-14,-13,-12,11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6]])
plt.xticks(rotation=90)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Same basic idea as box plot but different style of showing the distribution of points. There's clearly a weird rendering issue with the bottom of the violins... couldn't figure out how to fix it...

	locus_tag	product	type	gene_symbol	locus	start_coord	end_coord	note	translation	gene_len	...	5GB1_pA9_red_tpm	5GB1_pA9_yellow_tpm	5GB1C-5G-La-BR1_tpm	5GB1C-5G-La-BR2_tpm	5GB1C-5G-N-BR1_tpm	5GB1C-5G-N-BR2_tpm	5GB1C-JG15-La-BR1_tpm	5GB1C-JG15-La-BR2_tpm	5GB1C-JG15-N-BR1_tpm	5GB1C-JG15-N-BR2_tpm
0	EQU24_RS00005	chromosomal replication initiator protein DnaA	CDS	dnaA	NZ_CP035467.1	0	1317	Derived by automated computational analysis us...	MSALWNNCLAKLENEISSSEFSTWIRPLQAIETDGQIKLLAPNRFV...	1318	...	38.557373	38.810668	37.444214	40.246006	40.100118	33.432274	39.880174	38.355431	30.247582	41.248441
1	EQU24_RS00010	DNA polymerase III subunit beta	CDS		NZ_CP035467.1	1502	2603	Derived by automated computational analysis us...	MKYIINREQLLVPLQQIVSVIEKRQTMPILSNVLMVFRENTLVMTG...	1102	...	52.552767	52.461746	42.676553	49.210083	46.798476	48.142385	45.465136	46.498139	37.152951	52.902410
2	EQU24_RS00015	DNA replication/repair protein RecF	CDS	recF	NZ_CP035467.1	3060	4140	Derived by automated computational analysis us...	MSLQKLDIFNVRNIRQASLQPSPGLNLIYGANASGKSSVLEAIFIL...	1081	...	31.350991	34.914128	21.479309	24.204682	22.171104	22.006566	22.658157	22.753325	19.407103	29.834124
3	EQU24_RS00020	DNA topoisomerase (ATP-hydrolyzing) subunit B	CDS	gyrB	NZ_CP035467.1	4185	6600	Derived by automated computational analysis us...	MSENIKQYDSTNIQVLKGLDAVRKRPGMYIGDTDDGTGLHHMVFEV...	2416	...	74.848501	80.850761	54.959319	64.911376	59.653059	64.648318	69.119079	65.643179	57.590223	68.306759
4	EQU24_RS00025	hypothetical protein	CDS		NZ_CP035467.1	6825	7062	Derived by automated computational analysis us...	VKTTKYFLTTRMRPDREIIKDEWIQYVVRFPENEHIQFDGRIRRWA...	238	...	50.324948	49.349547	34.539657	36.521074	37.789611	39.358066	38.992158	35.870964	41.462392	40.227192

locus_tag	exp_condition	EQU24_RS00005	EQU24_RS00010	EQU24_RS00015	EQU24_RS00020	EQU24_RS00025	EQU24_RS00030	EQU24_RS00035	EQU24_RS00040	EQU24_RS00045	...	EQU24_RS22110	EQU24_RS22115	EQU24_RS22120	EQU24_RS22125	EQU24_RS22130	EQU24_RS22135	EQU24_RS22140	EQU24_RS22145	EQU24_RS22150	EQU24_RS22155
0	MeOH	23.333155	18.915775	18.453916	18.267805	16.960643	12.377795	43.815536	9.670950	7.302145	...	1298.257682	15.624619	20.208066	26.004364	20.960234	28.719983	93.616437	161.528124	496.990651	280.344047
1	NO3_lowO2_slow_growth	32.050358	43.656760	21.351623	62.267687	41.684925	31.921455	57.849768	16.885694	14.926147	...	6497.868109	26.273485	28.945133	23.525245	26.432667	35.167264	178.996199	164.083806	433.438735	493.895115
2	NoCu	44.348687	59.629360	28.268717	56.818319	49.839406	38.394652	81.530362	40.501969	36.576500	...	8345.785345	43.065124	34.380565	44.419579	34.601933	65.339879	253.608495	273.284694	731.052190	1087.621126
3	NoLanthanum	33.444023	43.689839	23.172675	57.297047	42.367072	41.941657	102.513601	30.226787	19.462312	...	5085.637409	16.423284	35.588138	44.623117	43.201743	21.927260	109.783330	67.277718	211.575175	328.943746
4	WithLanthanum	35.462185	41.792237	20.644554	57.130166	34.258335	46.201637	110.721781	31.813805	19.438086	...	3942.957792	15.972203	34.318829	49.216725	40.000662	21.220809	98.100610	73.116973	194.389586	319.998959
5	highCu	47.861477	79.109490	33.534043	73.330408	48.662214	33.986359	92.999818	51.950784	50.370579	...	8132.547467	48.894308	35.608730	46.109300	30.125207	89.710150	342.981435	386.493127	1021.453762	1692.401154
6	highO2_slow_growth	64.784508	99.002970	44.856281	78.997757	77.842263	56.626268	97.721756	35.735531	28.808125	...	3468.582202	40.548782	48.532405	37.139500	38.204218	52.033315	220.196691	244.139008	505.427610	561.847119
7	lowCH4	30.829331	33.532522	18.491160	42.963648	31.643505	24.308334	74.948663	18.601900	12.935098	...	7477.339715	18.356915	24.390308	20.231568	19.721043	60.397912	226.909132	297.029289	874.637567	506.211825
8	lowCu	42.973556	61.209155	28.828713	61.573321	50.966799	31.319574	75.047593	40.038670	35.331019	...	7157.344557	43.386082	33.574108	36.872718	31.899782	66.743497	293.599291	313.741841	843.607251	1123.669681
9	lowO2_fast_growth	35.736190	43.159066	27.325800	48.976066	33.463183	26.694205	72.505275	15.430765	12.595934	...	9584.028559	26.951394	26.808972	31.889009	23.882390	72.207950	334.033049	376.513322	1178.068431	1183.180057
10	medCu	44.920897	65.011074	29.419165	65.393162	48.729958	26.317981	80.124744	39.245705	32.128273	...	5934.168113	43.954096	36.396184	34.993692	32.907138	84.626668	340.031564	377.219038	1060.050784	1429.593942
11	uMax	52.045884	59.991257	33.943076	71.052489	48.648085	31.457496	85.968063	21.547665	18.048460	...	8591.278232	43.551349	38.609097	46.058548	31.182164	89.974638	431.225393	449.945881	1302.077251	1392.325117

	n	loc_count	loc_count_filt	min_exp
0	1	14	8	3022.573901
1	2	26	17	1653.480931
2	3	37	25	1131.648318
3	4	56	35	930.223684
4	5	84	43	667.493775
5	6	107	55	469.300182
6	7	127	65	369.788442
7	8	153	78	313.001223
8	9	178	92	313.001223
9	10	197	103	259.999672

	index	locus_tag	gene_symbol	product	type
0	0	EQU24_RS00005	dnaA	chromosomal replication initiator protein DnaA	CDS
1	12	EQU24_RS00010		DNA polymerase III subunit beta	CDS
2	24	EQU24_RS00015	recF	DNA replication/repair protein RecF	CDS
3	36	EQU24_RS00020	gyrB	DNA topoisomerase (ATP-hydrolyzing) subunit B	CDS
4	48	EQU24_RS00025		hypothetical protein	CDS
...	...	...	...	...	...
4208	50496	EQU24_RS22135	mnmE	tRNA uridine-5-carboxymethylaminomethyl(34) sy...	CDS
4209	50508	EQU24_RS22140	yidC	membrane protein insertase YidC	CDS
4210	50520	EQU24_RS22145	yidD	membrane protein insertion efficiency factor YidD	CDS
4211	50532	EQU24_RS22150	rnpA	ribonuclease P protein component	CDS
4212	50544	EQU24_RS22155	rpmH	50S ribosomal protein L34	CDS

	index	locus_tag	gene_symbol	product	type
0	9612	EQU24_RS04160	ssrS	6S RNA	ncRNA
1	27576	EQU24_RS12095		cytochrome c	CDS
2	28536	EQU24_RS12525	ssrA	transfer-messenger RNA	tmRNA
3	29544	EQU24_RS12965		hypothetical protein	CDS
4	36900	EQU24_RS16195		hypothetical protein	CDS
5	41292	EQU24_RS18125	moxI	methanol dehydrogenase	CDS
6	41304	EQU24_RS18130	moxG	cytochrome c(L), periplasmic	CDS
7	41328	EQU24_RS18140	moxF	PQQ-dependent dehydrogenase, methanol/ethanol ...	CDS
8	41844	EQU24_RS18355		hypothetical protein	CDS
9	44028	EQU24_RS19305	pmoB	methane monooxygenase/ammonia monooxygenase su...	CDS
10	44040	EQU24_RS19310	pmoA	methane monooxygenase/ammonia monooxygenase su...	CDS
11	44052	EQU24_RS19315	pmoC	methane monooxygenase/ammonia monooxygenase su...	CDS
12	45120	EQU24_RS19765	rnpB	RNase P RNA component class A	ncRNA
13	50436	EQU24_RS22110		hypothetical protein	CDS

	locus_tag	gene_symbol	product	type	op?	top_perc
153	EQU24_RS16195		hypothetical protein	CDS	False	1
168	EQU24_RS18355		hypothetical protein	CDS	False	1
164	EQU24_RS18130	moxG	cytochrome c(L), periplasmic	CDS	True	1
163	EQU24_RS18125	moxI	methanol dehydrogenase	CDS	True	1
130	EQU24_RS12965		hypothetical protein	CDS	True	1
177	EQU24_RS19305	pmoB	methane monooxygenase/ammonia monooxygenase su...	CDS	True	1
178	EQU24_RS19310	pmoA	methane monooxygenase/ammonia monooxygenase su...	CDS	True	1
166	EQU24_RS18140	moxF	PQQ-dependent dehydrogenase, methanol/ethanol ...	CDS	False	1
127	EQU24_RS12525	ssrA	transfer-messenger RNA	tmRNA	False	1
179	EQU24_RS19315	pmoC	methane monooxygenase/ammonia monooxygenase su...	CDS	False	1
181	EQU24_RS19765	rnpB	RNase P RNA component class A	ncRNA	False	1
124	EQU24_RS12095		cytochrome c	CDS	False	1
195	EQU24_RS22110		hypothetical protein	CDS	False	1
72	EQU24_RS04160	ssrS	6S RNA	ncRNA	True	1
165	EQU24_RS18135	moxJ	methanol oxidation system protein MoxJ	CDS	True	2
189	EQU24_RS21665	trxA	thioredoxin TrxA	CDS	False	2
140	EQU24_RS15100		HU family DNA-binding protein	CDS	False	2
36	EQU24_RS03495		cold-shock protein	CDS	False	2
150	EQU24_RS15745		cold-shock protein	CDS	False	2
116	EQU24_RS10370	acpP	acyl carrier protein	CDS	False	2

	index	locus_tag	gene_symbol	product	type
0	5148	EQU24_RS02240		F0F1 ATP synthase subunit B	CDS
1	5208	EQU24_RS02265		F0F1 ATP synthase subunit epsilon	CDS
2	6696	EQU24_RS02895		exosortase system-associated protein, TIGR0407...	CDS
3	6852	EQU24_RS02970	pqqA	pyrroloquinoline quinone precursor peptide PqqA	CDS
4	8040	EQU24_RS03495		cold-shock protein	CDS
5	9612	EQU24_RS04160	ssrS	6S RNA	ncRNA
6	16536	EQU24_RS07185		glutamate--ammonia ligase	CDS
7	16932	EQU24_RS07385	infC	translation initiation factor IF-3	CDS
8	16944	EQU24_RS07390	rpmI	50S ribosomal protein L35	CDS
9	20796	EQU24_RS09110	fae	formaldehyde-activating enzyme	CDS
10	23604	EQU24_RS10370	acpP	acyl carrier protein	CDS
11	27576	EQU24_RS12095		cytochrome c	CDS
12	28536	EQU24_RS12525	ssrA	transfer-messenger RNA	tmRNA
13	29544	EQU24_RS12965		hypothetical protein	CDS
14	34452	EQU24_RS15100		HU family DNA-binding protein	CDS
15	35448	EQU24_RS15535		hypothetical protein	CDS
16	35784	EQU24_RS15705		cold-shock protein	CDS
17	35880	EQU24_RS15745		cold-shock protein	CDS
18	36900	EQU24_RS16195		hypothetical protein	CDS
19	41136	EQU24_RS18060	rplM	50S ribosomal protein L13	CDS
20	41292	EQU24_RS18125	moxI	methanol dehydrogenase	CDS
21	41304	EQU24_RS18130	moxG	cytochrome c(L), periplasmic	CDS
22	41316	EQU24_RS18135	moxJ	methanol oxidation system protein MoxJ	CDS
23	41328	EQU24_RS18140	moxF	PQQ-dependent dehydrogenase, methanol/ethanol ...	CDS
24	41844	EQU24_RS18355		hypothetical protein	CDS
25	43560	EQU24_RS19105	rpsT	30S ribosomal protein S20	CDS
26	44028	EQU24_RS19305	pmoB	methane monooxygenase/ammonia monooxygenase su...	CDS
27	44040	EQU24_RS19310	pmoA	methane monooxygenase/ammonia monooxygenase su...	CDS
28	44052	EQU24_RS19315	pmoC	methane monooxygenase/ammonia monooxygenase su...	CDS
29	45120	EQU24_RS19765	rnpB	RNase P RNA component class A	ncRNA
30	48000	EQU24_RS21040	rpmB	50S ribosomal protein L28	CDS
31	49200	EQU24_RS21560	fbaA	class II fructose-bisphosphate aldolase	CDS
32	49212	EQU24_RS21565		transaldolase	CDS
33	49404	EQU24_RS21665	trxA	thioredoxin TrxA	CDS
34	49536	EQU24_RS21720		hypothetical protein	CDS
35	50304	EQU24_RS22055		hypothetical protein	CDS
36	50436	EQU24_RS22110		hypothetical protein	CDS