Explore and visualize TPM data

Given the TPM data matrix a user provides to get_top_gene_set.py, take some time to visually inspect each step of the processing pipeline and visualize some helpful summaries of the data.

In [1]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from Bio import SeqIO

import sys
sys.path.append('../') # use modules in main directory
import genbank_utils as gu
import get_top_gene_set as gtgs

Load expression data

Use the same data loading function from get_top_gene_set.py

In [2]:
# load TPM and feature data
data_file = '../data/extract_TPM_counts.tsv'
sample2condition_file = '../data/sample2condition.txt'
sample_file = None #'../config/samples_to_include.txt'
condition_file = '../config/conditions_to_include.txt'
gb_file = '../data/5GB1c_sequence.gb'

df, sample2condition, samples, conditions, pos_feats, neg_feats = gtgs.load_data(data_file, 
                                                                                 sample2condition_file, 
                                                                                 sample_file, 
                                                                                 condition_file,
                                                                                 gb_file)
In [3]:
df.head()
Out[3]:
locus_tag product type gene_symbol locus start_coord end_coord note translation gene_len ... 5GB1_pA9_red_tpm 5GB1_pA9_yellow_tpm 5GB1C-5G-La-BR1_tpm 5GB1C-5G-La-BR2_tpm 5GB1C-5G-N-BR1_tpm 5GB1C-5G-N-BR2_tpm 5GB1C-JG15-La-BR1_tpm 5GB1C-JG15-La-BR2_tpm 5GB1C-JG15-N-BR1_tpm 5GB1C-JG15-N-BR2_tpm
0 EQU24_RS00005 chromosomal replication initiator protein DnaA CDS dnaA NZ_CP035467.1 0 1317 Derived by automated computational analysis us... MSALWNNCLAKLENEISSSEFSTWIRPLQAIETDGQIKLLAPNRFV... 1318 ... 38.557373 38.810668 37.444214 40.246006 40.100118 33.432274 39.880174 38.355431 30.247582 41.248441
1 EQU24_RS00010 DNA polymerase III subunit beta CDS NZ_CP035467.1 1502 2603 Derived by automated computational analysis us... MKYIINREQLLVPLQQIVSVIEKRQTMPILSNVLMVFRENTLVMTG... 1102 ... 52.552767 52.461746 42.676553 49.210083 46.798476 48.142385 45.465136 46.498139 37.152951 52.902410
2 EQU24_RS00015 DNA replication/repair protein RecF CDS recF NZ_CP035467.1 3060 4140 Derived by automated computational analysis us... MSLQKLDIFNVRNIRQASLQPSPGLNLIYGANASGKSSVLEAIFIL... 1081 ... 31.350991 34.914128 21.479309 24.204682 22.171104 22.006566 22.658157 22.753325 19.407103 29.834124
3 EQU24_RS00020 DNA topoisomerase (ATP-hydrolyzing) subunit B CDS gyrB NZ_CP035467.1 4185 6600 Derived by automated computational analysis us... MSENIKQYDSTNIQVLKGLDAVRKRPGMYIGDTDDGTGLHHMVFEV... 2416 ... 74.848501 80.850761 54.959319 64.911376 59.653059 64.648318 69.119079 65.643179 57.590223 68.306759
4 EQU24_RS00025 hypothetical protein CDS NZ_CP035467.1 6825 7062 Derived by automated computational analysis us... VKTTKYFLTTRMRPDREIIKDEWIQYVVRFPENEHIQFDGRIRRWA... 238 ... 50.324948 49.349547 34.539657 36.521074 37.789611 39.358066 38.992158 35.870964 41.462392 40.227192

5 rows × 108 columns

In [4]:
feat2meta = gu.get_feat2meta_dict(gb_file)
feat2meta['EQU24_RS19310']
Out[4]:
{'gene_symbol': 'pmoA',
 'product': 'methane monooxygenase/ammonia monooxygenase subunit A',
 'type': 'CDS'}

Determine loci which may be in operons

Use the operon flagging function to call out which loci in the genome seem to be too close together on the same strand. The choice of 120 was determined for this organism (Methylotubimicrobium buryatense 5GB1) by exploring known operons inter-gene distances. See operon_distance_exploration.ipynb for more details.

In [5]:
op_min_dist = 120
maybe_operon_loci = gtgs.flag_potential_operon_loci(pos_feats, neg_feats, op_min_dist)

print(f"Number of possible operon loci at distance of {op_min_dist}bp: {len(maybe_operon_loci)}/{len(pos_feats)+len(neg_feats)}")
Number of possible operon loci at distance of 120bp: 2068/4431

Transform the data frame for mean TPM calculations

Pivot the data matrix and combine samples that belong to the same experimental condition. Values below are averages across samples.

In [6]:
# for the loaded TPM matrix, which column contains the unique gene ids?
LOCUS_ID_COL = 'locus_tag'
# use this column to get a full list of all genes for which expression was measured
LOCI = list(df[LOCUS_ID_COL].values)
In [7]:
df_means = gtgs.get_average_tpm_by_condition(df,samples,conditions,sample2condition,LOCI)
df_means
Out[7]:
locus_tag exp_condition EQU24_RS00005 EQU24_RS00010 EQU24_RS00015 EQU24_RS00020 EQU24_RS00025 EQU24_RS00030 EQU24_RS00035 EQU24_RS00040 EQU24_RS00045 ... EQU24_RS22110 EQU24_RS22115 EQU24_RS22120 EQU24_RS22125 EQU24_RS22130 EQU24_RS22135 EQU24_RS22140 EQU24_RS22145 EQU24_RS22150 EQU24_RS22155
0 MeOH 23.333155 18.915775 18.453916 18.267805 16.960643 12.377795 43.815536 9.670950 7.302145 ... 1298.257682 15.624619 20.208066 26.004364 20.960234 28.719983 93.616437 161.528124 496.990651 280.344047
1 NO3_lowO2_slow_growth 32.050358 43.656760 21.351623 62.267687 41.684925 31.921455 57.849768 16.885694 14.926147 ... 6497.868109 26.273485 28.945133 23.525245 26.432667 35.167264 178.996199 164.083806 433.438735 493.895115
2 NoCu 44.348687 59.629360 28.268717 56.818319 49.839406 38.394652 81.530362 40.501969 36.576500 ... 8345.785345 43.065124 34.380565 44.419579 34.601933 65.339879 253.608495 273.284694 731.052190 1087.621126
3 NoLanthanum 33.444023 43.689839 23.172675 57.297047 42.367072 41.941657 102.513601 30.226787 19.462312 ... 5085.637409 16.423284 35.588138 44.623117 43.201743 21.927260 109.783330 67.277718 211.575175 328.943746
4 WithLanthanum 35.462185 41.792237 20.644554 57.130166 34.258335 46.201637 110.721781 31.813805 19.438086 ... 3942.957792 15.972203 34.318829 49.216725 40.000662 21.220809 98.100610 73.116973 194.389586 319.998959
5 highCu 47.861477 79.109490 33.534043 73.330408 48.662214 33.986359 92.999818 51.950784 50.370579 ... 8132.547467 48.894308 35.608730 46.109300 30.125207 89.710150 342.981435 386.493127 1021.453762 1692.401154
6 highO2_slow_growth 64.784508 99.002970 44.856281 78.997757 77.842263 56.626268 97.721756 35.735531 28.808125 ... 3468.582202 40.548782 48.532405 37.139500 38.204218 52.033315 220.196691 244.139008 505.427610 561.847119
7 lowCH4 30.829331 33.532522 18.491160 42.963648 31.643505 24.308334 74.948663 18.601900 12.935098 ... 7477.339715 18.356915 24.390308 20.231568 19.721043 60.397912 226.909132 297.029289 874.637567 506.211825
8 lowCu 42.973556 61.209155 28.828713 61.573321 50.966799 31.319574 75.047593 40.038670 35.331019 ... 7157.344557 43.386082 33.574108 36.872718 31.899782 66.743497 293.599291 313.741841 843.607251 1123.669681
9 lowO2_fast_growth 35.736190 43.159066 27.325800 48.976066 33.463183 26.694205 72.505275 15.430765 12.595934 ... 9584.028559 26.951394 26.808972 31.889009 23.882390 72.207950 334.033049 376.513322 1178.068431 1183.180057
10 medCu 44.920897 65.011074 29.419165 65.393162 48.729958 26.317981 80.124744 39.245705 32.128273 ... 5934.168113 43.954096 36.396184 34.993692 32.907138 84.626668 340.031564 377.219038 1060.050784 1429.593942
11 uMax 52.045884 59.991257 33.943076 71.052489 48.648085 31.457496 85.968063 21.547665 18.048460 ... 8591.278232 43.551349 38.609097 46.058548 31.182164 89.974638 431.225393 449.945881 1302.077251 1392.325117

12 rows × 4214 columns

Example extraction of the top 3% of loci across all conditions

The printed output is of the same format of the output of get_top_gene_sets.py

In [8]:
# get top locus ids in all conditions
n = 3
top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, n)

# filter out loci maybe in operons
top_locs_op_filter_out = [x for x in top_locs if x in maybe_operon_loci]

# print top locus results
print(f'Top {n}%: {len(top_locs)} ({len(top_locs) - len(top_locs_op_filter_out)} filtered)\n')
print("Locus\toperon?")
for loc in top_locs:
    print(f'{loc}\t{loc in top_locs_op_filter_out}')
Top 3%: 37 (25 filtered)

Locus	operon?
EQU24_RS09110	True
EQU24_RS15745	False
EQU24_RS07185	False
EQU24_RS18355	False
EQU24_RS21560	False
EQU24_RS03495	False
EQU24_RS21040	False
EQU24_RS15535	False
EQU24_RS19765	False
EQU24_RS07385	True
EQU24_RS19105	False
EQU24_RS10370	False
EQU24_RS18130	True
EQU24_RS02970	False
EQU24_RS18125	True
EQU24_RS04160	True
EQU24_RS07390	False
EQU24_RS16195	False
EQU24_RS02265	True
EQU24_RS22055	True
EQU24_RS18060	False
EQU24_RS22110	False
EQU24_RS19305	True
EQU24_RS12965	True
EQU24_RS19310	True
EQU24_RS12095	False
EQU24_RS18140	False
EQU24_RS15705	False
EQU24_RS21720	False
EQU24_RS12525	False
EQU24_RS15100	False
EQU24_RS19315	False
EQU24_RS02240	True
EQU24_RS21565	False
EQU24_RS21665	False
EQU24_RS18135	True
EQU24_RS02895	False

Visualize tradeoffs in top N% threshold used

The above example uses 3%, but how do you pick that threshold? There's a tradeoff between the number of genes in your top set and the TPM expression of genes in the top. If you are more strict (smaller n), your top set will only consist of the very top genes, however you might not get very many of them. If you are more lenient (choose larger n), you'll get more top genes, but then you start to sample genes which on average reach rather low TPM values...

The following visuals are to help show the trade off and help you decide how the minimum TPM expression you will tolerate in your "top gene set" while showing you how many genes that qualify as being in the top.

We also build a tsv showing which genes qualify as "top" at each n% threshold.

In [9]:
tradeoff_data = []
top_n_loci_qualifers = []

# examine top sets using a range of top 1% to top 10%
for i in range(1,11):
    # get top locus ids in all conditions
    top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, i)
    top_locs_op_filter_out = [x for x in top_locs if x in maybe_operon_loci]
    
    # get the average expression of each locus across conditions
    means = []
    for loc in top_locs:
        mean_exp = np.mean(df_means[loc].values)
        means.append(mean_exp)
        
    # add row to the data frame consisting of:
    # n: top % threshold
    # locus count: number of top loci identified
    # locus count filtered: locus count after filtering operon genes
    # min expression: the minimum avg expression value from the top gene set
    row = [i,
           len(top_locs),
           len(top_locs)-len(top_locs_op_filter_out),
           min(means)
          ]
    
    tradeoff_data.append(row)
    
    # also, add all these loci to the top n gene sets
    for loc in top_locs:
        g = feat2meta[loc]['gene_symbol']
        p = feat2meta[loc]['product']
        t = feat2meta[loc]['type']
        op = loc in top_locs_op_filter_out
        i
        top_n_loci_qualifers.append([loc,g,p,t,op,i])
    
tradeoff_df = pd.DataFrame(tradeoff_data, columns=['n','loc_count','loc_count_filt','min_exp'])
all_top_n_df = pd.DataFrame(top_n_loci_qualifers, columns=['locus_tag','gene_symbol','product','type','op?','top_perc']).groupby('locus_tag').first().reset_index().sort_values('top_perc')
In [10]:
tradeoff_df
Out[10]:
n loc_count loc_count_filt min_exp
0 1 14 8 3022.573901
1 2 26 17 1653.480931
2 3 37 25 1131.648318
3 4 56 35 930.223684
4 5 84 43 667.493775
5 6 107 55 469.300182
6 7 127 65 369.788442
7 8 153 78 313.001223
8 9 178 92 313.001223
9 10 197 103 259.999672
In [11]:
all_top_n_df.head(20)
Out[11]:
locus_tag gene_symbol product type op? top_perc
153 EQU24_RS16195 hypothetical protein CDS False 1
168 EQU24_RS18355 hypothetical protein CDS False 1
164 EQU24_RS18130 moxG cytochrome c(L), periplasmic CDS True 1
163 EQU24_RS18125 moxI methanol dehydrogenase CDS True 1
130 EQU24_RS12965 hypothetical protein CDS True 1
177 EQU24_RS19305 pmoB methane monooxygenase/ammonia monooxygenase su... CDS True 1
178 EQU24_RS19310 pmoA methane monooxygenase/ammonia monooxygenase su... CDS True 1
166 EQU24_RS18140 moxF PQQ-dependent dehydrogenase, methanol/ethanol ... CDS False 1
127 EQU24_RS12525 ssrA transfer-messenger RNA tmRNA False 1
179 EQU24_RS19315 pmoC methane monooxygenase/ammonia monooxygenase su... CDS False 1
181 EQU24_RS19765 rnpB RNase P RNA component class A ncRNA False 1
124 EQU24_RS12095 cytochrome c CDS False 1
195 EQU24_RS22110 hypothetical protein CDS False 1
72 EQU24_RS04160 ssrS 6S RNA ncRNA True 1
165 EQU24_RS18135 moxJ methanol oxidation system protein MoxJ CDS True 2
189 EQU24_RS21665 trxA thioredoxin TrxA CDS False 2
140 EQU24_RS15100 HU family DNA-binding protein CDS False 2
36 EQU24_RS03495 cold-shock protein CDS False 2
150 EQU24_RS15745 cold-shock protein CDS False 2
116 EQU24_RS10370 acpP acyl carrier protein CDS False 2
In [12]:
# write out tsv of top loci and which threshold at which they qualify as "top"
all_top_n_df.to_csv('top_n_loci_qualifiers.tsv',sep='\t',index=False)
In [13]:
def tradeoff_plot(df):
    # number of genes in the top set
    num_genes_layer = alt.Chart(df).mark_circle(size=100).encode(
        x=alt.X('loc_count:Q', axis=alt.Axis(title='Number of loci in top n%',grid=False)),
        y=alt.Y('min_exp:Q', axis=alt.Axis(title='Min. mean TPM of loci in top n%',grid=False)),
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
        tooltip=[alt.Tooltip('n:Q', title='Top n%'),
                 alt.Tooltip('loc_count:Q', title='Num loci in top n%'),
                 alt.Tooltip('loc_count_filt:Q', title='Num loci in top n% after operon filtering'),
                 alt.Tooltip('min_exp:Q', title='Min. ave. expr. in top n%')
                ]
    )

    # number of genes in the top set after filtering operon genes
    op_ex_layer = alt.Chart(df).mark_point(size=100).encode(
        x='loc_count_filt:Q',
        y='min_exp:Q',
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
        tooltip=[alt.Tooltip('n:Q', title='Top N%'),
                 alt.Tooltip('loc_count:Q', title='Num loci in top n%'),
                 alt.Tooltip('loc_count_filt:Q', title='Num loci in top n% after operon filtering'),
                 alt.Tooltip('min_exp:Q', title='Min. ave. expr. in top n%')
                 ]
    )

    # connecting line between top set total and operon-filtered sets
    rule = alt.Chart(df).mark_rule().encode(
        x='loc_count_filt:Q',
        x2='loc_count:Q',
        y='min_exp:Q',
        color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
    )

    # top n% text label
    text = alt.Chart(df).mark_text(
        align='left',
        baseline='middle',
        angle=335,
        dx=5,
        dy=-7,
        size=14,
        color='black'
    ).encode(
        x='loc_count:Q',
        y='min_exp:Q',
        text=alt.Text('n:N'),
        #color=alt.Color('n:O', scale=alt.Scale(scheme='viridis'),legend=None),
    ).transform_calculate(n='"Top " + datum.n + "%"')

    # combine layers into chart
    chart = num_genes_layer + op_ex_layer + rule + text

    # final chart configs
    chart = chart.properties(
        title='Tradeoff between count and expression of top n% threshold'
    ).configure_title(
        fontSize=16,
    ).configure_axis(
        labelFontSize=14,
        titleFontSize=14
    ).configure_view(
        strokeOpacity=0
    ).interactive()
    
    #chart.save('tradeoff.html')
    return chart

Interactive plot!

Instructions:

* hover over points to see the exact TPM and count values
* pan and zoom with mouse and scroll
In [14]:
tradeoff_plot(tradeoff_df)
Out[14]:

This plot shows the tradeoff between number of genes in the top set (on the x-axis) and the mean TPM expression of the minimum gene in the top set (so the TPM value of the weakest gene that still qualified as being in the top n%). The filled circles represent the total number of genes in the top while the open circles are the same set but with loci-possibly-in-operons (because they were too close to another locus on the same strand) filtered out. So it is a subset of the filled circle.

The above visuals show a summary of all loci after being grouped into top sets at varying thresholds. The next visuals will instead explore the individual loci TPM profiles across conditions and compare the top set to the "rest" (all loci that did not make the top set).

Add relevant metadata to the expression df

In [15]:
# melt df to format for Altair parallel coordinates viz
alt_df = pd.melt(df_means, id_vars=['exp_condition'], value_vars=LOCI,value_name='mean_exp')

# add metadata columns from dict
alt_df['gene_symbol'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['gene_symbol'])
alt_df['product'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['product'])
alt_df['type'] = alt_df['locus_tag'].apply(lambda x: feat2meta[x]['type'])
alt_df['desc_string']  = alt_df.apply(lambda row: f"{row['locus_tag']}|{row['gene_symbol']}|{row['product']}",axis=1)

# impose a specific x-axis sort order on the df
# choose custom exp_condition order!
list_ordering = ['uMax','lowCH4','NoCu','lowCu','medCu','highCu','NO3_lowO2_slow_growth','highO2_slow_growth','lowO2_fast_growth','MeOH','NoLanthanum','WithLanthanum'] 
alt_df["exp_condition_order"] = pd.Categorical(alt_df["exp_condition"], categories=list_ordering)
alt_df['exp_id'] = alt_df['exp_condition'].apply(lambda x: list_ordering.index(x))
alt_df.head()
Out[15]:
exp_condition locus_tag mean_exp gene_symbol product type desc_string exp_condition_order exp_id
0 MeOH EQU24_RS00005 23.333155 dnaA chromosomal replication initiator protein DnaA CDS EQU24_RS00005|dnaA|chromosomal replication ini... MeOH 9
1 NO3_lowO2_slow_growth EQU24_RS00005 32.050358 dnaA chromosomal replication initiator protein DnaA CDS EQU24_RS00005|dnaA|chromosomal replication ini... NO3_lowO2_slow_growth 6
2 NoCu EQU24_RS00005 44.348687 dnaA chromosomal replication initiator protein DnaA CDS EQU24_RS00005|dnaA|chromosomal replication ini... NoCu 2
3 NoLanthanum EQU24_RS00005 33.444023 dnaA chromosomal replication initiator protein DnaA CDS EQU24_RS00005|dnaA|chromosomal replication ini... NoLanthanum 10
4 WithLanthanum EQU24_RS00005 35.462185 dnaA chromosomal replication initiator protein DnaA CDS EQU24_RS00005|dnaA|chromosomal replication ini... WithLanthanum 11
In [16]:
def view_df(df):
    '''Small function to display df gene info '''
    print(f"Gene count: {len(df['locus_tag'].unique())}")
    return df[['locus_tag','gene_symbol','product','type']].drop_duplicates().sort_values('locus_tag').reset_index()
In [17]:
view_df(alt_df)
Gene count: 4213
Out[17]:
index locus_tag gene_symbol product type
0 0 EQU24_RS00005 dnaA chromosomal replication initiator protein DnaA CDS
1 12 EQU24_RS00010 DNA polymerase III subunit beta CDS
2 24 EQU24_RS00015 recF DNA replication/repair protein RecF CDS
3 36 EQU24_RS00020 gyrB DNA topoisomerase (ATP-hydrolyzing) subunit B CDS
4 48 EQU24_RS00025 hypothetical protein CDS
... ... ... ... ... ...
4208 50496 EQU24_RS22135 mnmE tRNA uridine-5-carboxymethylaminomethyl(34) sy... CDS
4209 50508 EQU24_RS22140 yidC membrane protein insertase YidC CDS
4210 50520 EQU24_RS22145 yidD membrane protein insertion efficiency factor YidD CDS
4211 50532 EQU24_RS22150 rnpA ribonuclease P protein component CDS
4212 50544 EQU24_RS22155 rpmH 50S ribosomal protein L34 CDS

4213 rows × 5 columns

For a given set of top genes, visualize the expression across conditions

As an example, use 3 sets of top loci: 1%, 3% and 10%

In [18]:
top_locs_1_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 1)
df_1 = alt_df[alt_df['locus_tag'].isin(top_locs_1_perc)]

top_locs_3_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 3)
df_3 = alt_df[alt_df['locus_tag'].isin(top_locs_3_perc)]

top_locs_10_perc = gtgs.get_top_n_perc_by_condition(df_means, LOCI, 10)
df_10 = alt_df[alt_df['locus_tag'].isin(top_locs_10_perc)]
In [19]:
def altair_pcoords_plot_select_legend_and_highlight(df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    
    # selections
    highlight = alt.selection(type='single', on='mouseover',
                              fields=['desc_string'], nearest=True)
    
    selection = alt.selection_multi(fields=['desc_string'], bind='legend')

    
    # base?
    title=f'Average TPM of top {n}% of loci across conditions'
    base = alt.Chart(
        df.sort_values(xorder), 
        title=title,
        #titleFontSize=20
    ).encode(
        x=alt.X(f'{xorder}:N',
                sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM expression')),
        size=alt.value(100),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title='Gene', 
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )), 
    )
    
    # lines
    lines = base.mark_line().encode(
        size=alt.condition((selection|highlight), alt.value(3), alt.value(1)),
        opacity=alt.condition((selection|highlight), alt.value(1), alt.value(0.5))
    ).add_selection(
        selection,
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    # points
    points = base.mark_circle().encode(
        tooltip=['locus_tag','product','type','gene_symbol','desc_string'],
        opacity=alt.condition((selection|highlight), alt.value(1), alt.value(0.2)),
        size=alt.condition((selection|highlight), alt.value(100), alt.value(3))
    ).add_selection(highlight)

    chart = lines + points 
    
    chart = chart.configure_axis(
        labelFontSize=14,
        titleFontSize=14
    ).properties(
        title=f"Expression strength of loci in top {n}%"
    ).configure_title(
        fontSize=16,
    ).interactive()
    
    #chart.save(f"pcoords_top{n}.html")
    return chart

Interactive Visualization!

This plot shows individual gene expression profiles across the 12 conditions used in this analysis.

Instructions:

* hover over lines to show gene info. 
* select colors in the legend to highlight specific genes
* shift click to select multiple genes from the legend
* pan and zoom with mouse and scroll

Top 1%

In [20]:
altair_pcoords_plot_select_legend_and_highlight(df_1,"1")
Out[20]:
In [21]:
# df of genes in top 1%
view_df(df_1)
Gene count: 14
Out[21]:
index locus_tag gene_symbol product type
0 9612 EQU24_RS04160 ssrS 6S RNA ncRNA
1 27576 EQU24_RS12095 cytochrome c CDS
2 28536 EQU24_RS12525 ssrA transfer-messenger RNA tmRNA
3 29544 EQU24_RS12965 hypothetical protein CDS
4 36900 EQU24_RS16195 hypothetical protein CDS
5 41292 EQU24_RS18125 moxI methanol dehydrogenase CDS
6 41304 EQU24_RS18130 moxG cytochrome c(L), periplasmic CDS
7 41328 EQU24_RS18140 moxF PQQ-dependent dehydrogenase, methanol/ethanol ... CDS
8 41844 EQU24_RS18355 hypothetical protein CDS
9 44028 EQU24_RS19305 pmoB methane monooxygenase/ammonia monooxygenase su... CDS
10 44040 EQU24_RS19310 pmoA methane monooxygenase/ammonia monooxygenase su... CDS
11 44052 EQU24_RS19315 pmoC methane monooxygenase/ammonia monooxygenase su... CDS
12 45120 EQU24_RS19765 rnpB RNase P RNA component class A ncRNA
13 50436 EQU24_RS22110 hypothetical protein CDS

Top 3%

In [22]:
altair_pcoords_plot_select_legend_and_highlight(df_3,"3")
Out[22]:
In [23]:
# df view of genes in top 3%
view_df(df_3)
Gene count: 37
Out[23]:
index locus_tag gene_symbol product type
0 5148 EQU24_RS02240 F0F1 ATP synthase subunit B CDS
1 5208 EQU24_RS02265 F0F1 ATP synthase subunit epsilon CDS
2 6696 EQU24_RS02895 exosortase system-associated protein, TIGR0407... CDS
3 6852 EQU24_RS02970 pqqA pyrroloquinoline quinone precursor peptide PqqA CDS
4 8040 EQU24_RS03495 cold-shock protein CDS
5 9612 EQU24_RS04160 ssrS 6S RNA ncRNA
6 16536 EQU24_RS07185 glutamate--ammonia ligase CDS
7 16932 EQU24_RS07385 infC translation initiation factor IF-3 CDS
8 16944 EQU24_RS07390 rpmI 50S ribosomal protein L35 CDS
9 20796 EQU24_RS09110 fae formaldehyde-activating enzyme CDS
10 23604 EQU24_RS10370 acpP acyl carrier protein CDS
11 27576 EQU24_RS12095 cytochrome c CDS
12 28536 EQU24_RS12525 ssrA transfer-messenger RNA tmRNA
13 29544 EQU24_RS12965 hypothetical protein CDS
14 34452 EQU24_RS15100 HU family DNA-binding protein CDS
15 35448 EQU24_RS15535 hypothetical protein CDS
16 35784 EQU24_RS15705 cold-shock protein CDS
17 35880 EQU24_RS15745 cold-shock protein CDS
18 36900 EQU24_RS16195 hypothetical protein CDS
19 41136 EQU24_RS18060 rplM 50S ribosomal protein L13 CDS
20 41292 EQU24_RS18125 moxI methanol dehydrogenase CDS
21 41304 EQU24_RS18130 moxG cytochrome c(L), periplasmic CDS
22 41316 EQU24_RS18135 moxJ methanol oxidation system protein MoxJ CDS
23 41328 EQU24_RS18140 moxF PQQ-dependent dehydrogenase, methanol/ethanol ... CDS
24 41844 EQU24_RS18355 hypothetical protein CDS
25 43560 EQU24_RS19105 rpsT 30S ribosomal protein S20 CDS
26 44028 EQU24_RS19305 pmoB methane monooxygenase/ammonia monooxygenase su... CDS
27 44040 EQU24_RS19310 pmoA methane monooxygenase/ammonia monooxygenase su... CDS
28 44052 EQU24_RS19315 pmoC methane monooxygenase/ammonia monooxygenase su... CDS
29 45120 EQU24_RS19765 rnpB RNase P RNA component class A ncRNA
30 48000 EQU24_RS21040 rpmB 50S ribosomal protein L28 CDS
31 49200 EQU24_RS21560 fbaA class II fructose-bisphosphate aldolase CDS
32 49212 EQU24_RS21565 transaldolase CDS
33 49404 EQU24_RS21665 trxA thioredoxin TrxA CDS
34 49536 EQU24_RS21720 hypothetical protein CDS
35 50304 EQU24_RS22055 hypothetical protein CDS
36 50436 EQU24_RS22110 hypothetical protein CDS

Now let's compare the top set to the rest of the loci - did we really pick a super strong set?

To explore this visually, we with still plot each locus with its own line, but now we will add grey bands showing the first and second standard deviations of the remaining genes. The mean of the "rest" is also depicted as a dotted line.

In [24]:
def get_stdev_bounds(df_comp):
    '''
    Given a dataframe of TPM expression for a top set complement (aka
    all loci NOT in the top set), determin the mean and first/second
    standard deviations
    '''
    rows = []
    for cond, cond_df in df_comp.groupby('exp_condition'):
        vals = cond_df['mean_exp'].values
        m = np.mean(vals)
        sd = np.std(vals)
        sd1_high = m + sd
        sd1_low = max((m - sd), 0)
        sd2_high = m + (2*sd)
        sd2_low = max((m - (2*sd)),0)
        #print(cond, m, sd)
        
        rows.append([cond, m, sd, sd1_high, sd1_low, sd2_high, sd2_low])
        
    return pd.DataFrame(rows, columns=['exp_condition','mean_exp','stdev','stdev1_high','stdev1_low','stdev2_high','stdev2_low'])


def stdev_plot(n,short=False,save=False):
    '''
    Given an n% threshold, get the top locs for this gene set,
    get it's complement and produce
    a lineplot of the top genes vs the 1st/2nd standard deviation
    of the remaining genes
    '''
    
    # get loci in top n% and filter altair-formatted df
    top_locs = gtgs.get_top_n_perc_by_condition(df_means, LOCI, n)
    df_n = alt_df[alt_df['locus_tag'].isin(top_locs)]
    # num loci in the top - used for plotting color palette
    num_loci = len(df_n['locus_tag'].unique())
    
    # get complement of top locs (everything NOT in the top set)
    df_n_comp = alt_df[~alt_df['locus_tag'].isin(top_locs)]
    
    # get stdev boundary df of complement
    df_n_comp_stdev = get_stdev_bounds(df_n_comp)
    df_n_comp_stdev["exp_condition_order"] = pd.Categorical(df_n_comp_stdev["exp_condition"], categories=list_ordering)
    df_n_comp_stdev['exp_id'] = df_n_comp_stdev['exp_condition'].apply(lambda x: list_ordering.index(x))
    
    # if more than 40 lines are drawn, make them slightly transparent
    alpha = 0.2 if len(df_n['desc_string'].unique()) > 40 else 1

    # +--------------+
    # | start figure |
    # +--------------+
    fig_dim = (8,4) if short else (8,6)
    fig = plt.figure(figsize=fig_dim)
    
    # top loci lines
    sns.lineplot(data=df_n, x='exp_id', y='mean_exp',hue='desc_string', 
                 palette=sns.color_palette("hls",num_loci),alpha=alpha,legend=None)
    
    # get x-axis order for stdev fill
    x_ax = df_n_comp_stdev.sort_values('exp_id')['exp_id'].values
    
    # mean of not-top loci
    #sns.lineplot(data=df_n_comp, x='exp_condition_order', y='mean_exp',color='black',lw=3)
    comp_means = df_n_comp_stdev.sort_values('exp_condition_order')['mean_exp'].values
    plt.plot(x_ax,comp_means,color='black',linestyle='--',lw=3)

    

    # plot second stdev
    lower_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev2_low'].values
    upper_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev2_high'].values
    plt.fill_between(x_ax,
                     lower_bound,
                     upper_bound,
                     color=sns.xkcd_rgb['light grey']
                    )

    # first stdev
    lower_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev1_low'].values
    upper_bound = df_n_comp_stdev.sort_values('exp_condition_order')['stdev1_high'].values
    plt.fill_between(x_ax,
                     lower_bound,
                     upper_bound,
                     color=sns.xkcd_rgb['greyish']
                    )


    # chart config 
    plt.yscale('log')
    plt.ylim(1)
    plt.title(f"Expression strength of loci in top {n}% compared to rest", fontsize=18)
    plt.xticks(x_ax,fontsize=14)
    plt.yticks(fontsize=14)
    plt.xlabel("Experimental condition ID",fontsize=14)
    plt.ylabel("Log TPM expression",fontsize=14)
    sns.despine(bottom = False, left = False)
    #plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    if save:
        short_str = '_short' if short else ''
        plt.savefig(f'figs/top{n}_loci_vs_rest{short_str}.eps', format='eps')
        plt.savefig(f'figs/top{n}_loci_vs_rest{short_str}.png', format='png')
    
    plt.show()
In [25]:
stdev_plot(1)

The black dotted line is the mean of the "rest" of the genes in each condition, the dark grey is the 1st standard deviation, and the light grey is the 2nd standard deviation

In [26]:
stdev_plot(3)
In [27]:
stdev_plot(10)

Experimenting with other visualization styles

I tried a few other versions for visualizing the summary of the "remaining loci." They each have pros and cons but ultimately I thought the standard deviation bands (in the above visualizations) mostly cleanly communicated the message.

In [28]:
alt.data_transformers.disable_max_rows()
Out[28]:
DataTransformerRegistry.enable('default')
In [29]:
# get complement of top n% of genes
df_1_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_1_perc)]
#df_3_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_3_perc)]
#df_10_comp = alt_df[~alt_df['locus_tag'].isin(top_locs_10_perc)]
In [30]:
def overlapping_lines(df,comp_df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    title=f'Average TPM of top {n}% of loci across conditions'
    

    # lines
    lines = alt.Chart(
        df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log')),
        size=alt.value(5),
        opacity=alt.value(0.8),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title=f'Loci in top {n}%', 
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )) 
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    
    # complement
    lines_comp = alt.Chart(
        comp_df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM Expression')),
        size=alt.value(0.5),
        opacity=alt.value(0.1),
        detail='desc_string:N',
        color=alt.value('gray')
    ).properties(
         width=600,
         height=400
    ).interactive()
    
    
    chart = lines + lines_comp
    
    chart = chart.configure_axis(
        grid=False
    ).configure_view(
        strokeOpacity=0
    ).properties(
        title=f"Expression strength of loci in top {n}% compared to rest"
    ).configure_title(
        fontSize=16,
    )
    
    #chart.save(f"pcoords_overlap_top{n}.html")
    return chart
In [31]:
overlapping_lines(df_1,df_1_comp,'1')
Out[31]:

I liked that this visual displays transparent versions of all the other loci's lines, and thus the dark band in the middle is a good indication of where most loci TPM levels are. However is has a slightly spaghetti feel...

In [32]:
def lines_with_box(df,comp_df,n,xorder='exp_condition_order'):
    # If more than 30 entries, make 2 columns
    col_num = 1 if len(df['locus_tag'].unique()) <=50 else 2
    title=f'Average TPM of top {n}% of loci across conditions'
    

    # lines
    lines = alt.Chart(
        df.sort_values(xorder)
    ).mark_line().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field=f"{xorder}:N", op="count"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log'),axis=alt.Axis(title='Log TPM Expression')),
        size=alt.value(5),
        opacity=alt.value(0.8),
        color=alt.Color('desc_string:N',
                        legend=alt.Legend(title=f'Loci in top {n}%',
                                          orient = 'right',
                                          labelLimit=0,
                                          columns=col_num,
                                          symbolLimit=200
                                         )) 
    ).properties(
         width=600,
         height=400
    )
    
    
    # complement
    box = alt.Chart(
        comp_df.sort_values(xorder)
    ).mark_boxplot().encode(
        x=alt.X(f'{xorder}:N',
                sort=list_ordering,
                #sort=alt.EncodingSortField(field="exp_condition_order"),
                axis=alt.Axis(title='Experimental Condition')
               ),        
        y=alt.Y('mean_exp:Q',scale=alt.Scale(type='log')),
        color=alt.value('gray')
    ).properties(
         width=600,
         height=400
    )    
    chart = lines + box
    
    chart = chart.configure_axis(
        grid=False
    ).configure_view(
        strokeOpacity=0
    ).properties(
        title=f"Expression strength of loci in top {n}% compared to rest"
    ).configure_title(
        fontSize=16,
    )
    
    #chart.save(f"pcoords_box_top{n}.html")
    return chart
In [33]:
lines_with_box(df_1,df_1_comp,"1")
Out[33]:

This is the same data but summarized with box plots. I'm not a huge fan of the Altair rendering

In [34]:
fig = plt.figure(figsize=(10,10))
sns.lineplot(data=df_1, x='exp_condition_order', y='mean_exp',hue='desc_string')
sns.boxplot(data=df_1_comp,x='exp_condition_order',y='mean_exp')

plt.yscale('log')
plt.xticks(rotation=90)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Same plot as above but with seaborn. I like the boxes better aesthetically but the "outlier" dots are still a bit confusing. Why aren't those genes in the top? Well, the dot is high in one condition, but it goes way low in other conditions, therefore it does not qualify as being "top" in ALL conditions. But the lines aren't connected here so it's hard to see the drop.

In [35]:
fig = plt.figure(figsize=(10,10))
sns.lineplot(data=df_1, x='exp_condition_order', y='mean_exp',hue='desc_string')
v = sns.violinplot(data=df_1_comp,x='exp_condition_order',y='mean_exp')

#v.yaxis.set_ticks([np.log10(x) for p in range(-6,6) for x in np.linspace(10**p, 10**(p+1), 10)])

plt.yscale('log')
#plt.ylim(-100,600000)
#plt.yticks([np.log10(-6),np.log10(-5),np.log10(-4),np.log10(-3),np.log10(-2),np.log10(-1),np.log10(0),np.log10(1),np.log10(2),np.log10(3),np.log10(4),np.log10(5),np.log10(6)])
#plt.yticks([10**x for x in [-15,-14,-13,-12,11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0,1,2,3,4,5,6]])
plt.xticks(rotation=90)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Same basic idea as box plot but different style of showing the distribution of points. There's clearly a weird rendering issue with the bottom of the violins... couldn't figure out how to fix it...

In [ ]: