bio_db.pl -- Access, use and manage big, biological datasets.

Bio_db gives access to pre-packed biological databases and simplifies management and translation of biological data to Prolog friendly formats.

There are currently 2 major types of data supported: maps, and graphs. Maps define product mappings, translations and memberships, while graphs define interactions which can be visualised as weighed graphs (see bio_db_data_predicate/4 for a full list of statically generated list of bio_db data predicates).

There are 2 prolog flags (see current_prolog_flag/2) that can control the behaviour of the library: bio_db_qcompile (def: true) and bio_db_interface (def: prolog). When the first one is set to false, it can disable the compilation to

Bio_db itself does include any of the datasets. You can either download the separate pack(bio_db_repo) which contains all of the Prolog datasets or let pack(bio_db) download the data file one at the time- as needed. As of version v4.4 there are 144 associated data predicates serving 76398976 records.

This pack can be installed as per usual via

?- pack(bio_db_repo).

However, please note this will download all available tables (zipped) with a total download of 477Mb (v4.4). The first time a table is interrogated it is unzipped ot the .pl version and the interpreter automatically also create a .qlf. When the all the tables have been access at least once, the pack will take around 6.3Gb (v4.4).

If you do not want to install all datasets, you should not install the pack as above. Instead pack(bio_db) will download individual data tables the first time you try to access some of its data. Auto-downloading works transparently to the user, where a data set is downloaded by simply calling the predicate.

For example

?- hgnc_homs_symb_hgnc( 'LMTK3', Hgnc ).
% prolog DB:table hgnc:hgnc_homs_symb_hgnc/2 is not installed, do you want to download (Y/n) ?
% Trying to get: url_file(http://www.stoics.org.uk/bio_db_repo/data/maps/hgnc/hgnc_homs_symb_hgnc.pl,/usr/local/users/nicos/local/git/test_bio_db/data/maps/hgnc/hgnc_homs_symb_hgnc.pl)
% Loading prolog db: /usr/local/users/nicos/local/git/test_bio_db/data/maps/hgnc/hgnc_homs_symb_hgnc.pl
Hgnc = 19295.

?- bio_db_interface( prosqlite ).
% Setting bio_db_interface prolog_flag, to: prosqlite
true.

?- hgnc_homs_prev_symb( Prv, Symb ).
% prosqlite DB:table hgnc:hgnc_homs_prev_symb/2 is not installed, do you want to download (Y/n) ?
% Trying to get: url_file(http://www.stoics.org.uk/bio_db_repo/data/maps/hgnc/hgnc_homs_prev_symb.sqlite,/usr/local/users/nicos/local/git/test_bio_db/data/maps/hgnc/hgnc_homs_prev_symb.sqlite)
false.

?- hgnc_homs_prev_symb( Prv, Symb ).
% prosqlite DB:table hgnc:hgnc_homs_prev_symb/2 is not installed, do you want to download (Y/n) ?
% Trying to get: url_file(http://www.stoics.org.uk/bio_db_repo/data/maps/hgnc/hgnc_homs_prev_symb.sqlite,/usr/local/users/nicos/local/git/test_bio_db/data/maps/hgnc/hgnc_homs_prev_symb.sqlite)
% Loading prosqlite db: /usr/local/users/nicos/local/git/test_bio_db/data/maps/hgnc/hgnc_homs_prev_symb.sqlite
Prv = 'A1BG-AS',
Symb = 'A1BG-AS1' .

See bio_db_data_predicate/4 for a way to enumerate all data predicates. The source of which is in src/bio_db_data_predicate.pl which also includes in the comments the cell structure.

As of version 2.0 bio_db is formed of a number of hierarchically organised cells that can be loaded independently. This is because there now too many predicates and is also a devise for better supporting organism specific data. There are currently two main cells, hs (human) and mouse. Each sub-celled by data source of origin.

?- use_module(library(bio_db)).

Loads the whole interface (all cells), without the user needing to be aware of anything. The only difference is that the user will not be able to see all the module predicates at the first line of file pack(bio_db/prolog/bio_db.pl)).

?- lib(bio_db).

Also loads everything.

?- lib(& bio_db).

Loads the skeleton of the module (cells usually laod the module dependencies like this).

?- lib(& bio_db(hs)).

Loads hs cell (and skeleton). hs comprises of a number of sub-cells.

?- lib(& bio_db(hs(hgnc))).

Loads the hs/hgnc primary cell (and the skeleton).

In both the above loads, the following becomes available, however, the former load also loads additional predicates for human, but non hgnc based.

?- hgnc_homs_hgnc_symb( Hgnc, 'LMTK3' ).
Hgnc = 19295.

The following

?- use_module( pack('bio_db/cell/hs/hgnc') ).

also loads just the HGNC part of the human section of bio_db, but it is not a recommended way to do so.

Organisms

galg: Gallus gallus (red junglefowl), colloquial: chicken
homs: Homo sapiens, colloquial: human
mult: covers multiple organisms, longer form: multi
musm: Mus musculus, colloquial: mouse
suss: Sus scrofa (wild boar or Eurasian boar) colloquial: pig

Databases

Ensembl=ense: Homo sapiens genes and proteins. Genes and trascripts mappings along with mapping to genomic location (latter not included in release yet)
HGNC=hgnc: Hugo Gene Nomenclature Committee, http://www.genenames.org/
NCBI=ncbi: NCBI
Uniprot=unip: Protein database.
String=strg: Protein-Protein interactions data base
MGI=mgim: Mouse Genome Informatics, mouse specific datasets (last M for marker, their identifier)
Reactome=reac: Pathway database

For each database, a relation token with the same name, maps the field is the unique identifier of that database.

Other relation tokens

symb: HGNC gene symbol (short, unique name for genes)
name: (HGNC) gene name (long, less standarised version of gene name)
prev: HGNC previous gene symbol
syno: HGNC gene symbol synonym
ensg: ensembl gene
enst: ensembl transcript
ensp: ensembl protein
gonm: GO name of a term
pros: Prosite protein family information
rnuc: RNA nucleic sequence ID to HGNC symbol.
unig: uniprotein gene id
sprt: Swiss-Prot part of Uniprot (high quality, curated)
trem: TrEMBL part of Uniprot (non curated)
mgim: MGI Marker (identifier for Mouse Genome Informatics Markers)
cgnc: Chicken gene nomenclature committee
taxo: taxonomy id (NCBI)
scnm: scientific names for species (NCBI)
gbnm: genbank common name (NCBI)

The name convention for map predicates is

   ?- hgnc_homs_hgnc_symb( Hgnc, Symb ).
   Hgnc = 1,
   Symb = 'A12M1~withdrawn' ;
   Hgnc = 2,
   Symb = 'A12M2~withdrawn' .

   ?- hgnc_homs_hgnc_symb( 19295, Symb ).
   Symb = 'LMTK3'.

   ?- hgnc_homs_symb_hgnc( 'LMTK3', Hgnc ).
  Hgnc = 19295.

Where the first hgnc corresponds to the source database, the second token, homs, identifies the organism, the third and fourth tokens are the fields of the map. Above, the second hgnc

The last part of the predicate name corresponds to the second (or all other) argument(s), which here is the unique Symbol assigned to a gene by HGNC. In the current version of bio_db, all tokens in map filenames are 4 characters long. Map data for predicate Pname from database DB are looked for in DB(Pname.Ext) (see bio_db_paths/0). Extension, Ext, depends on the current bio_db database interface (see bio_db_interface/1), and it is sqlite if the interface is prosqlite and pl otherwise.

The name convention for graphs is

  ?- strg_homs_edge_symb( Symb1, Symb2, W ).
  S1 = 'A1BG',
  S2 = 'ABAT',
  W = 360 ;
  S1 = 'A1BG',
  S2 = 'ABCC6',
  W = 158 .

The first part indicates the database and the second one the organism/species. Graph data for predicate Pname from database DB are looked for in bio_db_data(graphs/DB/Pname.Ext) (see bio_db_paths/1). Extension, Ext, depends on the current bio_db database interface (see bio_db_interface/1), and it is sqlite if the interface is prosqlite and pl otherwise.

Bio_db supports four db interfaces: prolog, prosqlite, berkeley and rocks. The first one is via Prolog fact bases, which is the default. The second is an interface to SQLite via pack(prosqlite) while the third and fourth work with the SWI-Prolog packs bdb and rocksdb. The underlying mechanisms are entirely transparent to the user. In order to use the sqlite data sources pack(prosqlite) needs to be installed via the pack manager

 ?- pack_install( prosqlite ).

The user can control which interface is in use with the bio_db_interface/1 predicate.

 ?- bio_db_interface( Curr ).
 Curr = prolog.

 ?- bio_db_interface( prosqlite ).

 ?- bio_db_interface( Curr ).
 Curr = prosqlite.

The type of the interface of a bio_db data predicate is determined by the interface at the time of first call.

Once the user has initiated the serving of a predicate via calling a goal to it, it is then possible to have access to information about the dataset such as download date and sourle url.

?- hgnc_homs_hgnc_symb( Hgnc, Symb ).
Hgnc = 1,
Symb = 'A12M1~withdrawn' .

?- bio_db_info( hgnc_homs_hgnc_symb/2, Key, Value ), write( Key-Value ), nl, fail.
interface-prolog
source_url-ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc_complete_set.txt.gz
datetime-datetime(2018,11,27,12,32,11)
data_types-data_types(integer,atom)
unique_lengths-unique_lengths(46023,46023,46023)
relation_type-relation_type(1,1)
header-row(HGNC ID,Approved Symbol)
false

As of version 2.0 there are two flags that can automate some of the interactions.

:- set_prolog_flag(bio_db_pl_from_zip, user).
:- set_prolog_flag(bio_db_del_zip, user).

In both cases the recognised values for the flags are: [user,true,false]. User is for prompting the user and true is progressing with an implicit yes answer. The first flag automates conversion from .pl.zip to .pl (which will be the case for the first time you access any dataset if you have installed bio_db_repo), and the second controls the deletion of the zip file once the .pl file has been created.

As of version 4.0 there are 91 associated data predicates serving 55444729 records.

Thanks to Jan Wielemaker for a retractall fix and for code for fast loading of precompiled fact bases (and indeed for the changes in SWI that made this possible).

author: - nicos angelopoulos
version: - 0.5 2016/09/11; - 0.7 2016/10/21, experimenting with distros in github; - 0.9 2017/03/10, small changes for pack(requires) -> pack(lib) v1.1; - 1.0 2017/10/09, to coincide with ppdp paper presentation; - 2.1 2018/11/27, introduces cells and mouse data (and fixed dependency of 2.0); - 2.4 2019/04/02, test: bio_db_stats, new mouse db predicates, iface: bio_db_data_predicate/4; - 2.5 2019/04/22, edge_strg_symb/4; bio_db_organism/1,2; go_id/2,3; - 2.6 2019/05/08, changed to organism alias interface; evidence in gont maps; - 2.7 2019/05/12, edge_strg_symb/4 -> org_edge_strg_symb/4; - 3.0 2019/05/15, paper submission; - 3.1 2020/03/09, fixed lib; no unigene; - 3.2 2020/09/18, include mouse ense + fixes/updates on building scripts; - 3:4 2021/05/10, removed edge_gont_includes/2 (reciprocal of is_a), and edge_gont_consists_of/2 (reciprocal of part_of/2); - 3:6 2021/12/04, fixed pack_errors and map_ense_mouse_enst_chrl/5; bio_db_stats.pl version 0.2; - 4:1 2022/12/29, huge re-config of data predicate names + reac-tome (maps) + chicken; - 4:2 2023/06/06, support for pig; - 4:3 2023/10/05, mult for multi organisms; vgnc database; ncbi taxonomy db; build-reorganisation; - 4:4 2024/04/05, db(ncbi) preds were complete rehaul, better and more complete db(reactome) support, fixed pig cells; - 4:5 2024/04/05, fixed certificate issue when downloading individual files of bio_db_repo
See also: - doc/Releases.txt for version details; - bio_db_data_predicate/4 for a way to enumerate all data predicates; - cell/ for the definitions of the data predicates

bio_db_organism(?Org)

Colloquial name for organisms supported by bio_db.

Human is considered the default organism and returned first.

?- bio_db_organism(Org).
Org = human ;
Org = chicken ;
Org = mouse ;
Org = multi ;
Org = pig.

author: - nicos angelopoulos
version: - 0:2 2019/4/8; - 0:3 2022/12/29, changed to colloquials and added chicken, were hs and mouse.; - 0:4 2023/6/3, added pig

bio_db_organism(?KnownAs, ?Canon)

bio_db_organism(?KnownAs, ?Token, ?Canon)

Canon is the canonical, colloquial, representation of organism KnownAs and Token is a 4 letter bio_db representation of that organism.

KnownAs is either a known colloquial name tabled in bio_db_organism/1, an alias to an organism or an organism token. Token is the token used in bio_db predicate, file and directory names for this organism.

?- bio_db_organism(KnownAs,Org), write(KnownAs:Org), nl, fail.
hs:human
gallus:chicken
gallus_gallus:chicken
gg6a:chicken
human:human
chicken:chicken
mouse:mouse
galg:chicken
homs:human
musm:mouse
suss:pig
mult:multi

?- bio_db_organism(hs, Org).
Org = human.

?- bio_db_organism(KnownAs, Token, human).
KnownAs = hs,
Token = homs ;
KnownAs = human,
Token = homs ;
KnownAs = Token, Token = homs.

?- hgnc_homs_symb_hgnc( 'LMTK3', Hgnc ).
Hgnc = 19295.

author: - nicos angelopoulos
version: - 0.2 2019/5/2; - 0.3 2022/12/25, added /3 version, and added many aliases

bio_db_organism_alias(?Alias, -Org)

Alias is a known and supported alternative name for the canonical Org name for an organism.

?- bio_db_organism_alias( human, hs ).
true.

Note this used to be bio_db_organism/2 which has now (19.05.02) changed.

author: - nicos angelopoulos
version: - 0:1 2019/5/2; - 0:2 2022/12/20, gallus also known as chicken and gallus_gallus

bio_db_paths

Initialisation call- setting up path aliases.

There are two main directory repositories the predicate deals with: (a) the bio_db installed databases root (alias bio_db_data), and (b) the root of downloaded databases (alias bio_db_downloads). Optionally a top directory of which both (a) and (b) are subdirs can be defined (alias bio_db). The default value for alias bio_db is a made-up pack directory pack(bio_db_repo). The default for bio_db_data is sub directory data of alias bio_db, while bio_db_downloads defaults to sub directory downloads of the alias bio_db. The canonical subdirectory name for (a) is data and for (b) is downloads.

pack(bio_db_repo) can also be installed as a complete package from SWI's manager.

?- pack_install( bio_db_repo ).

This will install all the Prolog database files. The single tar and gzipped file is 246 Mb in size and the fully expanded version of a Prolog installation can take up to 3.1Gb. The precise size depends on how many tables are accessed at least once (each producing an expanded .pl and a .qlf file).

Directory locations for (a) and (b) above can be given as either prolog flags with key bio_db_root and bio_dn_root respectively or via environment variables BioDbRoot and BioDnRoot.

Installed root alias(bio_db_data) contains sub-dirs

graphs: for graphs; string and reactome
maps: for all the supported maps

The above are mapped to aliases bio_graphs and bio_maps respectively. Within each of these sub-directories there is further structure based on the database the set was originated.

Downloaded root alias(bio_db_downloads) may contain sub-dirs

hgnc: data from HGNC database
ncbi: data from NCBI database
reactome: data from Reactome database
string: data from string database
uniprot: protein data from EBI
ense: ensembl database

Alias bio_db_downloads is only useful if you are downloading data files directly from the supported databases.

See

?- absolute_file_name( packs(bio_db(auxil)), Auxil ), ls( Auxil ).

for examples of how these can be used.

For most users these aliases are not needed as the library manages them automatically.

To be done: - transfer datasets and downloads to new pack location when running on newly installed SWI version upgrade.

bio_db_version(-Vers, -Date)

Version Mj:Mn:Fx, and release date date(Y,M,D).

?- bio_db_version( V, D ).
V = 4:5:0,
D = date(2024, 4, 5).

author: - Nicos Angelopoulos
version: - 4:5 2024/4/5, fixed broken download of individual repo data preds
See also: - bio_db_data_predicate/4 (which should be generated for each new version); - doc/Releases.txt for more detail on change log; - module documentation for brief comments on versioning history of this pack

bio_db_citation(-Atom, -Bibterm)

This predicate succeeds once for each publication related to this library. Atom is the atom representation suitable for printing while Bibterm is a bibtex(Type,Key,Pairs) term of the same publication. Produces all related publications on backtracking.

bio_db_source(?Type, ?Db)

True if Db is a source database for bio_db serving predicate of type Type. Type is either maps or graphs.

The databases are

hgnc
gont
ncbi
string
unip

To be done: - fixme: this gets out of synch for new dbs, maybe have it in other location or throw a message if it fails ?

bio_db_interface(?Iface, -Status)

Interrogate the installation status (true or false) of bio_db's known interfaces. true if the interface dependencies are installed and the interface can be used, and =|false=| otherwise.

Can be used to enumerate all known or installed interfaces.

 ?- findall( Iface, bio_db_interface(Iface,_), Ifaces ).
 Ifaces = [prolog, berkeley, prosqlite, rocks].

bio_db_interface(?Iface)

Interrogate or set the current interface for bio_db database predicates. By default Iface = prolog. Also supported: prosqlite (needs pack proSQLite), berkley (needs SWI's own library(bdb) and rocks (needs pack(rocskdb).

?- bio_db_interface( Iface ).
Iface = prolog.

?- debug( bio_db ).
true.

?- bio_db_interface( wrong ).
% Could not set bio_db_interface prolog_flag, to: wrong, which in not one of: [prolog,prosqlite,berkeley,rocks]
false.

?- bio_db_interface( Iface ).
Iface = prolog.

?- hgnc_homs_symb_hgnc( 'LMTK3', Hgnc ).
% Loading prolog db: /usr/local/users/nicos/local/git/lib/swipl-7.1.32/pack/bio_db_repo/data/maps/hgnc/hgnc_homs_symb_hgnc.pl
Hgnc = 19295.

?- bio_db_interface( prosqlite ).
% Setting bio_db_interface prolog_flag, to: prosqlite
true.

?- hgnc_homs_prev_symb( Prev, Symb ).
% prosqlite DB:table hgnc:hgnc_homs_prev_symb/2 is not installed, do you want to download (Y/n) ?
% Execution Aborted
?- hgnc_homs_prev_symb( Prev, Symb ).
% Loading prosqlite db: /usr/local/users/nicos/local/git/lib/swipl-7.1.32/pack/bio_db_repo/data/maps/hgnc/hgnc_homs_prev_symb.sqlite
Prev = 'A1BG-AS',
Symb = 'A1BG-AS1' ;

In which case Iface is prosqlite.

bio_db_install(+PidOrPname, +Iface)

bio_db_install(+PidOrPname, +Iface, +Opts)

Install the interface (Iface) for bio_db database that corresponds to predicate identifier (Pid) or a predicate name (Pname). Note that this is not necessary to do in advance as the library will auto load missing Iface and Pid combinations when first interrogated.

Opts

interactive(Ictive=true): set false to accept default interactions
org(Org=hs): organism

bio_db_info(+Pid, ?Iface)

bio_db_info(+Pid, ?Key, -Value)

bio_db_info(+Iface, +Pid, ?Key, -Value)

Retrieve information about bio_db database predicates.

When Iface is not given, Key and Value are those of the interface under which Pid is currently open for access. The predicate errors if Pid is not open for serving yet.

The bio_db_info/2 version succeeds for all interfaces Pid is installed- it is simply a shortcut to: bio_db_info( Iface, Pid, _, _ ).

The Key-Value information returned are about the particular data predicate as saved in the specific backend.

Key

source_url: an atomic value of the URL
datetime: datetime/6 term
data_types: data_types/n given the primary type for each argyument in the data table
header: row/n term, where n is the number of columns in the data table
unique_lengths: unique_lengths/3 term, lengths for the ordered sets of: Ks, Vs and KVs
relation_type(From, TO): where From and To take values in 1 and m

?- bio_db_info( Iface, hgnc_homs_hgnc_symb/2, Key, Value), write( Iface:Key:Value ), nl, fail.
prolog:source_url:ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc_complete_set.txt.gz
prolog:datetime:datetime(2016,9,10,0,2,14)
prolog:data_types:data_types(integer,atom)
prolog:unique_lengths:unique_lengths(44266,44266,44266)
prolog:relation_type:relation_type(1,1)
prolog:header:row(HGNC ID,Approved Symbol)
prosqlite:source_url:ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc_complete_set.txt.gz
prosqlite:datetime:datetime(2016,9,10,0,2,14)
prosqlite:data_types:data_types(integer,atom)
prosqlite:unique_lengths:unique_lengths(44266,44266,44266)
prosqlite:relation_type:relation_type(1,1)
prosqlite:header:row(HGNC ID,Approved Symbol)

bio_db_close(+Pid)

Close the current serving of predicate Pid. Next time a Pid Goal is called the current interface (bio_db_interface/1) will be used to establish a new server and resolve the query.

Predicate throws an error if the Pid does not correspond to a db_predicate or if it is not currently servered by any of the backends.

?- bio_db_interface(prosqlite).
?- hgnc_homs_hgnc_symb( Hgnc, Symb ).
Hgnc = 506,
Symb = 'ANT3~withdrawn' .

?- bio_db_close( hgnc_homs_hgnc_symb/2 ).
?- bio_db_interface( prolog ).
?- hgnc_homs_hgnc_symb( Hgnc, Symb ).
Hgnc = 1,
Symb = 'A12M1~withdrawn' .
?- bio_db_close(hgnc_homs_hgnc_symb/2).

bio_db_close_connections

Close all currently open bio_db backend connections.

This is called by bio_db at halt.

bio_db_db_predicate(?Pid)

True if Pid is a predicate identifier which is defined in current bio_db session, and contains 4 _ sep tokens, each of length 4. When Pid is a free variable all such predicate identifiers are returned on backtracking.

For a statically produced list of all data predicates in bio_db see, bio_db_data_predicate/4.

  ?- bio_db_db_predicate( hgnc_homs_hgnc_symb/2 ).
  true.

  ?- bio_db_db_predicate( X ).
  X = hgnc_homs_symb_ncbi/2 ;
  X = ense_homs_enst_ensg/2 ;
  ...

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

bio_db_data_predicate(Arg1, Arg2, Arg3, Arg4)
go_id(Arg1, Arg2, Arg3)
ncbi_symb(Arg1, Arg2, Arg3)
is_symbol(Arg1, Arg2)
go_id(Arg1, Arg2)
bio_db_org_in_opts(Arg1, Arg2)
org_edge_strg_symb(Arg1, Arg2, Arg3, Arg4)