bio_analytics.pl -- Computational biology data analytics.

Collects a number of biological data analytics tasks. This library provides tools for the bio_db served data, empowering downstream analyses of user's experimental data.

Installation and loading:

?- pack_install(bio_analytics).  % also installs pack(lib). other dependencies are at load time via lib
?- use_module(library(lib)).
?- lib(bio_analytics).

The library comes with one experimental dataset: data/silac/bt.csv which is used in the example files in directory examples/ .

author
- nicos angelopoulos
version
- 0.1 2019/4/22
- 0.2 2019/5/08
- 0.3 2019/5/11
- 0.4 2020/9/18
license
- MIT
 gene_family(+Family, +Org, -Symbols)
If Family is a known family alias for organism Org, it is expanded to a list of its constituent gene symbols.

Currently 2 organisms are supported: human and mouse.
Family can be a gene ontology term (atom of the form GO:XXXXXX).
If family is a list of numbers or atoms that map to numbers, then they are taken to be Entrez ids which are converted to gene symbols in Symbols.

If family is a list of symbols, it is passed on to Symbols.

Listens to debug(gene_family).

Known families:

autophagy/human
from: http://autophagy.lu/clustering/index.html
 Located family as the bio_analytics gene collection: autophagy
 ?- gene_family( autophagy, human, Auto ), length( Auto, Len ).
 Auto = ['AMBRA1', 'APOL1', 'ARNT', 'ARSA', 'ARSB', 'ATF4', 'ATF6', 'ATG10', 'ATG12'|...],
 Len = 232.

 ?- debug( gene_family ).
 ?- gene_family( 375, hs, Symbs ), length( Symbs, Len ).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.
 
 ?- gene_family( 'GO:0000375', hs, Symbs ), length( Symbs, Len ).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.

 ?- gene_family( [55626, 8542, 405], hs, Auto ).
 Converted input from Entrezes to Symbols.
 Auto = ['AMBRA1', 'APOL1', 'ARNT'].

 ?- gene_family( unknown, hs, Auto ).
 ERROR: Unhandled exception: gene_family(cannot_find_input_family_in_the_known_ones(unknown,[autophagy]))

Family datasets are in pack(bio_analytics/data/families).

author
- nicos angelopoulos
version
- 0:1 2019/3/5, from old code
To be done
- error reporting via print_message/2
 go_org_symbol(+Org, +Go, -Symb)
Get symbols of Go id for Organism.

Go can be either a GO: atom or an integer.

[debug]  ?- go_org_symbol( mouse, 2, Symbs ), write(Symbs), nl, fail.
Akt3
Mef2a
Mgme1
Mpv17
Mrpl15
Mrpl17
Mrpl39
Msto1
Opa1
Slc25a33
Slc25a36
Tymp
false.

[debug]  ?- go_org_symbol( hs, 'GO:0000002', Symbs ), write(Symbs), nl, fail.
AKT3
LONP1
MEF2A
MGME1
MPV17
MSTO1
OPA1
PIF1
SLC25A33
SLC25A36
SLC25A4
TYMP
false.
author
- nicos angelopoulos
version
- 0:1 2019/04/07
 go_org_symbols(+GoT, +Org, -Symbols)
Gets the symbols belonging to a GO term.
?- go_org_symbols( 'GO:0000375', human, Symbs ), length( Symbs, Len ).
Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP'|...],
Len = 26.

?- go_org_symbols( 'GO:0000375', mouse, Symbs ), length( Symbs, Len ).
Symbs = ['Dbr1', 'Rbm17', 'Rnu4atac', 'Rnu6atac', 'Scaf11', 'Sf3a2', 'Slu7', 'Srrm1', 'Srsf10'|...],
Len = 10.

?- go_org_symbols( 375, human, Symbs ), length( Symbs, Len ).
Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
Len = 26.
author
- nicos angelopoulos
version
- 0.1 2019/4/7
See also
- go_org_symb/3, just a findall of
- go_symbols_reach/3 for a version that travells the ontology
 go_symbols_reach(+GoT, -Symbols, +Opts)
Gets the symbols belonging to a GO term. Descents to GO child relations, which by default are includes (reverse of is_a) and consists_of (reverse of part_of) to pick up Symbols recursively.

Opts

organism(Org=hs)
also known is: mouse. Note that human, also resolves to hs
descent(Desc=true)
whether to collect symbols from descendant GO terms
as_child_includes(Inc=true)
collect from edge_gont_include/2
as_child_consists_of(Cns=true)
collect from edge_gont_consists_of/2
as_child_regulates(Reg=false)
collect from edge_gont_regulates/2
as_child_negatively_regulates(Reg=false)
collect from edge_gont_negatively_regulates/2
as_child_positively_regulates(Reg=false)
collect from edge_gont_positively_regulates/2
debug(Dbg=false)
see options_append/3

Listens to debug(go_symbols_reach).

?- go_symbols_reach( 'GO:0000375', Symbs, [] ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.

?- go_symbols_reach( 'GO:0000375', Symbs, organism(mouse) ), length( Symbs, Len ).
Symbs = ['4930595M18Rik', 'Aar2', 'Aqr', 'Bud13', 'Bud31', 'Casc3', 'Cdc40', 'Cdc5l', 'Cdk13'|...].
Len = 190.

?- go_symbols_reach( 375, Symbs, true ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.
author
- nicos angelopoulos
version
- 0.1 2015/7/26
- 0.2 2019/4/7 added organism, moved to new pack
 go_string_graph(+GoTerm, -Graph, +Opts)
Plot the STRING graph of all the Symbols in a GO term.

The Graph can be saved via options to wgraph_plot/2.

Opts are passed to symbols_string_graph/3, go_term_symbols/3 and wgraph_plot/2.

Opts

organism(Org=hs)
organism for the operation
plot(Plot=false)
whether to plot the graph
save(Save=false)
whether to save the graph (passed to wgraph_plot/2
stem_type(Sty=go_name)
constructs stem for the output filenames: go_id, go_pair, go_pair_ord(I,Len).
?- go_string_graph( 'GO:0016601', G, true ).
?- go_string_graph( 'GO:0016601', G, plot(true) ).

?- go_string_graph( 'GO:0016601', G, [plot(true),save(true)] ).
G = ['ABI2', 'AIF1', 'CDH13', 'HACD3', 'NISCH', 'ALS2'-'RAC1':892, 'BRK1'-'CYFIP1':999, ... - ... : 904, ... : ...|...].
?- ls.
% Rac_protein_signal_transduction.csv                Rac_protein_signal_transduction_graph.csv          Rac_protein_signal_transduction_layout.csv
true.

?- go_string_graph( 'GO:0016601', G, [plot(true),save(true),stem_type(go_pair)] ).
author
- nicos angelopoulos
version
- 0.1 2016/4/12
- 0.2 2020/9/3, expanded go_string_graph_stem/3
To be done
- make sure underlying options are compatible.
 symbols_string_graph(+Symbols, -Graph, +Opts)
Create the string database Graph between Symbols.

Opts

cohese(Coh=max)
method for cohesing multiple edges between two nodes [false,max,min,umax,umin] the max and min versions are more efficient but do sort the edges, whereas umax and umin take much longer but leave edges in found order
include_orphans(Orph=true)
set to false to exclude orphans from Graph
org(Org=hs)
which organism do the gene symbols come from
minw(500)
minimum weight (0 =< x =< 999) - not checked
sort_pairs(Spairs=true)
set to false to leave order of edges dependant on order of Symbols
sort_graph(Sort=true)
set to false for not sorting the results
?- Got = 'GO:0043552', gene_family( Got, hs, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['AMBRA1', 'ATG14', 'CCL19', 'CCL21', 'CCR7'|...],
SymbsLen = 31,
Graph = ['TNFAIP8L3', 'AMBRA1'-'ATG14':981, 'AMBRA1'-'PIK3R4':974|...],
GraphLen = 235.

% the following is flawed as it uses Got from mouse and tries to build the graph in human STRING...
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = Graph, Graph = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8', 'Fgf2', 'Fgfr3', 'Fgr'|...],
SymbsLen = GraphLen, GraphLen = 28.

% this the correct way for running the first query for mouse:
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, org(mouse) ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8'|...],
SymbsLen = 28,
Graph = ['Ambra1', 'Cdc42', 'Lyn', 'Nod2', 'Pdgfra', 'Sh3glb1', 'Tnfaip8l3', 'Vav3', ... : ...|...],
GraphLen = 117.
author
- nicos angelopoulos
version
- 0.1 2016/1/18
- 0.2 2019/4/8, added organism to incorporate mouse
- 0.3 2020/9/5, option cohese(), and debugs
To be done
- implement cohese() values: min, umax and umin
 exp_diffex(+Exp, -DEs, -NonDEs, +Opts)
Select a sublist of significantly, differentially expressed genes in an experiment.

Exp is in the form of a mtx/1 matrix. Selected (DEs) and rejected (NonDEs) are returned in common format, of either (default) Gene-Ev, where Ev is the expression value for Gene in Exp, or sub-matrices of Exp. The format is controlled by option as_pairs.

We use ev as short for expression value, to avoid confusion with exp which short for experiment.

Opts

as_non(AsNon=pvalue)
what to return as non differentially expressed: either everything in Gcnm (AsNon=all, default when EvLog is true), or only those with numeric pvalue (AsNon=pvalue, default when EvLog is false)
as_pairs(AsPairs=true)
whether to return pairs or matrices
de_max(DexMax=false)
puts a cap on the number of differentially expressed genes returned
de_mtx(DexMtx)
returns the selected matrix. if an atom, its taken to be a filename
exp_pv_cnm(ExpPcnm=adj.pvalue)
the experimental column (found in MsF) on which Pcut is applied as a filter
exp_pv_cut(Pcut=0.05)
p.value cut off for experimental input from MSstats
exp_ev_log(EvLog=true)
are the expression values log values
exp_ev_cnm(EvCnm)
expession value column name (log2FC if EvLog=true, expression otherwise)
exp_ev_cut_let(EvLet = -1)
expression values below (less or equal) which values are selected
exp_ev_cut_get(EvGet)
expression values above (greater or equal) which values are selected (defaults is 2 if EvLog is false, and 1 otherwise)
exp_ev_include_inf(IncInf=false)
include infinity values as diffexs ? (default is false if EvLog is false, and true otherwise)
gene_id_cnm(Gcnm=Symbols)
column name for the key value in the pair lists: DEs and NonDEs.
?- lib(mtx),
   absolute_file_name(pack('bio_analytics/data/silac/bt.csv'), CsvF),
   mtx(CsvF, Mtx, convert(true)),
   assert(mtx_data(Mtx)).

CsvF = '/usr/local/users/nicos/local/git/lib/swipl/pack/bio_analytics/data/silac/bt.csv',
Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', ...), row(..., ..., ..., ...)|...].

?- lib(debug_call),
   debug(testo),
   mtx_data(Mtx),
   debug_call(testo, dims, mtx/Mtx),
   exp_diffex(Mtx, DEPrs, NonDEPrs, []),
   debug_call(testo, length, de/DEPrs),
   debug_call(testo, length, nde/NonDEPrs).

% Dimensions for matrix,  (mtx) nR: 1245, nC: 4.
% Length for list, de: 426.
% Length for list, nde: 818.

DEPrs = ['CLEC1B'- -4.80614893171469, 'LGALS1'- -2.065877096524, ... - ...|...],
NonDEPrs = ['CNN2'- -0.69348026828097, 'CXCR4'-0.73039667221395, ... - ...|...].

?-
    mtx_data(Mtx),
    exp_diffex(Mtx, DEPrs, NonDEPrs, exp_pv_cut(0.01)),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 286.
% Length for list, nde: 958.

?-
    mtx_data(Mtx),
    Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf)],
    exp_diffex(Mtx, DEPrs, NonDEPrs, Opts),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 581.
% Length for list, nde: 663.

?- mtx_data(Mtx),
    Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf),as_pairs(false)],
    exp_diffex(Mtx, DEs, NonDEs, Opts),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 582.
% Length for list, nde: 664.

Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row(..., ..., ..., ...)|...],
Opts = [exp_ev_cut_let(inf), exp_ev_cut_get(-inf), as_pairs(false)],
DEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', 'CLEC1B', -4.80614893171469, 2.057e-37), row(..., ..., ..., ...)|...],
NonDEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('B4DUT8;Q6FHC3;Q6FHE4;Q99439;B4DDF4;Q53GK7;B4DN57;B4DHU5;K7ES69;H3BVI6;H3BQH0', 'CNN2', -0.69348026828097, 0.0814675), row(..., ..., ..., ...)|...].
author
- nicos angelopoulos
version
- 0.1 2019/5/2
- 0.2 2020/9/3, ability to return sub-matrices (9/14): diffex_mtx()
 exp_gene_family_string_graph(+Exp, +Family, -Graph, +Opts)
 exp_gene_family_string_graph(+Exp, +Family, ?DEPrs, ?NonDEPrs, -Graph, +Opts)
Generate and possibly plot the STRING graph of a known gene family as affected in a biological Exp_eriment. Each family gene is placed of the following states according to wether it was identified in Exp and how it was modified (in relation to background conditions):

Note that de-reguation trumps identification, that is, currently there is no distiction between genes that seen to be both significantly de-regulated and identified versus just those that are simply significantly de-regulated.

In the exp_gene_family_string_graph/6 version DEPrs and NonDEPrs can be provided, which optimises use in loops over families (see exp_go_over_string_graphs/4).

Opts

exp_ev_log(EvLog=true)
are the expression values (ev) log values ? (also passed to exp_diffex/4)
faint_factor(Fctr=1.2)
faint factor for non-significant nodes
include_non_present(IncP=true)
false excludes non-indentified family genes
include_non_significant(IncS=true)
false excluded non-significant family genes
node_colours(Clrs=clrs(red, green, yellow, khaki4, grey))
colours for the different types of nodes
org(Org=hs)
organism in which Family is looked in (and experiment was performed)
plot(Plot=true)
whether to plot the graph via wgraph_plot/2
wgraph_plot_opts(WgOpts=[])
options for wgraph_plot/2 (in preference to any defaults from Self)
wgraph_plot_defs(WgDefs=[])
options for wgraph_plot/2 (added to the end, after any defaults from Self)

These Options are passed to a number of other pack predicates.

See [pack('bio_analytics/examples/bt.pl')].

?- lib(real), lib(mtx),
   absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   Ppts = [ vjust= -1, node_size(3), mode="fruchtermanreingold", format(svg), stem(bt)],
   Opts = [ exp_ev_cut_let(inf), exp_ev_cut_get(-inf),
            include_non_present(false), include_non_significant(false),
            minw(200), wgraph_plot_opts(Ppts) ],
   exp_gene_family_string_graph( CsvF, autophagy, G, Opts ).

Produces file: bt.svg

author
- nicos angelopoulos
version
- 0.1 2019/4/15
See also
- exp_diffex/4
- gene_family/3
- wgraph_plot/2
 exp_go_over_string_graphs(+Exp, ?GoOver, ?Dir, -Opts)
Create string graphs for all the over-represented terms in GoOver.

When GoOver is a variable expr_go_over/3 is called to generate it.

Opts

dir_postfix(Psfx=go_strings)
postfix for outputs directory (when Dir is a variable)
go_id_clm(GoIdClm=1)
column id for over represented GO terms
ov_max(MaxOvs=false)
when a number, it is taken as the maximal integer of terms to plot graphs for
stem_type(Sty=go_pair_ord)
similar to go_string_graph/3, but different default, others: go_name, go_id. Here the length of GO terms (Len) is added to form go_pair_ord(I,Len) for forwarding
viz_de_opts(VizOpts=[])
options for restricting genes to visualise via exp_diff/4. Default does not restrict what genes are visualised. diffex_only = VizOpts restricts genes to significants only.
wgraph_plot_opts(WgOpts=WgOpts)
defaults to [vjust = -1, node_size(3), format(svg)].

Options are passed to exp_gene_family_string_graph/4.

?- debug(real).
?- debug(exp_go_over_string_graphs).
?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), Exp ),
   exp_go_over_string_graphs( Exp, Gov, Dir, [] ).

% Sending to R: pltv <- ggnet2(lp_adj,vjexp_go_over_mtxust = -1,size = 3,label = pl_v_1,color = pl_v_2,edge.size = pl_v_3,edge.color = "#BEAED4")
author
- nicos angelopoulos
version
- 0.1 2019/5/5
- 0.2 2020/9/6, option ov_max()
See also
- exp_go_over/3
- exp_gene_family_string_graph/4
 exp_go_over_string_graphs_multi(+Opts)
Run multiple exp_go_over_string_graphs/4 runs.

Can ran across a number of values for parameters and number of datasets. A directory is created for each combination of input parameters. Subdirectories are created for each input dataset.

Can be used with single parameter values as well, in order to add that parameter to the output directory's name.

Opts

data(Data=data)
which datasets to use, can be a directory
data_ext(DtExt=csv)
extension selector for when Data is a directory
dated(Dated=true)
adds date (ate_two_digit_dotted/1) stamp on output directories
dfx_opts(DfxOpts)
invariant options for exp_go_over_string_graphs/4
dfx_viz_opts(VizOpts)
invariant options for viz_de_opts() of exp_go_over_string_graphs/4
keep_mtx_sel(KpSel=true)
record, to file, the mtx of values significant values selected
keep_mtx_viz(KpViz=true)
record, to file, the mtx of values added for vizualisation (non significants)
mtx_opts(MtxOpts=[])
options to pass when reading the input matrices
multi(Keys, Vals)
parameters and associated possible values to pass to exp_go_over_string_graphs/4 (Vals, can be a single value). If multiple Keys are given, then they synchronise- always taking the
viz_diffex_only_constraint(DoCon=true)
enforce that when de_max() and exp_pv_cut() are the same, use viz_de_opts(diffex_only)
viz_pv_constraint(PvCon=true)
enforce that PvCut for non-significants must be equal or larger than that for significants

multi() options are processed in the order they are given, with later ones varied first in the runs.

?- Opts = [],
   exp_go_over_string_graphs_multi( Opts ).
author
- nicos angelopoulos
version
- 0:1 2020/09/17
To be done
- shorten to exp_multi/1 ?
 bio_analytics_version(-Vers, -Date)
Version and release date.
?- bio_analytics_version(V,D).
V = 0:4:0,
D = date(2020,9,18).

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

 exp_go_over(Arg1, Arg2, Arg3)