A semantic data integration toolbox for biodiversity sciences.
« Back to homepage
Clone the project repository
$ git clone https://github.com/nleguillarme/inteGraph.git
Run install.sh
$ cd inteGraph ; sh install.sh
To run inteGraph in Docker you just need to execute the following:
$ make up
This will create and start a set of containers that you can list with the docker ps
command:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
21f4c8d28833 integraph-airflow-webserver "/usr/bin/dumb-init …" 4 minutes ago Up 3 minutes (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp integraph-airflow-webserver-1
ab066b04f9b8 integraph-airflow-scheduler "/usr/bin/dumb-init …" 4 minutes ago Up 3 minutes (healthy) 8080/tcp integraph-airflow-scheduler-1
3a5a84452e98 gnames/gognparser:latest "gnparser -p 8778" 4 minutes ago Up 4 minutes 0.0.0.0:8778->8778/tcp, :::8778->8778/tcp integraph-gnparser-1
70b2ec620cda postgres:13 "docker-entrypoint.s…" 8 minutes ago Up 8 minutes (healthy) 5432/tcp integraph-postgres-1
In particular, this starts an instance of the Airflow scheduler and webserver. The webserver is available at http://localhost:8080.
To exit inteGraph and properly close all containers, run the following command:
$ make down
The structure of a typical inteGraph project looks like the following:
my-project/
|-- graph.cfg
|-- connections.json
|-- sources.ignore
|
|-- sources/
| |-- source_1/
| | |-- source.cfg
| | |-- mapping.xlsx
| |
| |-- source_2/
| | |-- source.cfg
| | |-- mapping.xlsx
graph.cfg
inteGraph uses an INI-like file to configure the knowledge graph creation process. This configuration file can contain the following sections:
Property | Description | Values |
---|---|---|
id |
The base IRI of the knowledge graph. It will be used to generate a graph label IRI for each data source. |
iri, example http://leca.osug.fr/my_kg |
This section can be empty, in which case inteGraph will use the default property values.
Property | Description | Values |
---|---|---|
dir |
The path to the directory containing the configuration of the data sources. It can be absolute or relative to the directory containing graph.cfg . |
path, default sources |
This section contains configuration properties for connecting to the triplestore. The properties in this section vary depending on the triplestore implementation.
Property | Description | Values |
---|---|---|
id |
The identifier of the triplestore implementation. | {graphdb , rdfox } |
conn_type |
The type of connection. | {http } |
host |
The URL or IP address of the host. | url or ip, example 0.0.0.0 |
port |
The port number. | int, example 7200 |
user |
The user login. The user should have read/write permissions to the repository. |
str, optional, example integraph |
password |
The user password. | str, optional, example p@ssw0rd |
repository |
The identifier of the target repository. | str, example my-kg |
This section is optional and is used to declare the ontologies that will be searched during the semantic annotation of your data. Each line is a key-value pair shortname=iri
where shortname
will be used as the ontology’s internal identifier and iri
is the ontology’s IRI or a valid path to the ontology.
Below is an example graph configuration file:
[graph]
id=http://leca.osug.fr/my_kg
[sources]
dir=sources
[load]
id=graphdb
conn_type=http
host=129.88.204.79
port=7200
repository=my-kg
[ontologies]
sfwo="http://purl.org/sfwo/sfwo.owl"
Each data source to be integrated into the knowledge graph must have its own INI-like configuration file. To add a new data source to your inteGraph project, create a new directory in your sources directory, and create a file source.cfg
in this directory. This configuration file can contain the following sections:
Property | Description | Values |
---|---|---|
id |
The internal unique identifier of the data source. | str, example source_1 |
This subsection can be empty. You can use it to specify metadata about the data source using any of the fifteen terms in the Dublin Core™ Metadata Element Set (also known as “the Dublin Core”).
Below is an example [source.metadata]
section specifying metadata for the BETSI database:
[source.metadata]
title=A database for soil invertebrate biological and ecological traits
creator=Hedde et al.
subject=araneae, carabidae, chilopoda, diplopoda, earthworms, isopoda
description=The Biological and Ecological Traits of Soil Invertebrates database (BETSI, https://portail.betsi.cnrs.fr/) is a European database dedicated specifically to soil organisms’ traits.
date=2021
format=csv
identifier=hal-03581637
language=en
This section allows you to define any number of semantic annotators. The role of a semantic annotator is to match a piece of data with the concept in the target ontology/taxonomy that best captures its meaning.
To create a new annotator, add a new subsection [annotators.MyAnnotator]
where MyAnnotator
should be a unique identifier for the annotator. This subsection can contain the following properties:
Property | Description | Values |
---|---|---|
type |
The type of the semantic annotatior. | {taxonomy , ontology , map } |
source |
If type=taxonomy , the identifier of the taxonomy used by the data source. |
{NCBI , GBIF ,IF }, optional |
targets |
If type=taxonomy , an ordered list containing the identifiers of the target taxonomies. |
list, optional, default ["NCBI", "GBIF", "IF", "EOL", "OTT"] |
filter_on_ranks |
If type=taxonomy , a list of taxa which will be used to restrict the search when trying to match taxa on the basis of their scientific names. |
list, optional, example: ["Eukaryota", "Protista", "Protozoa"] |
multiple_match |
If type=taxonomy , how multiple matches are handled. |
{strict , warning , ignore }, optional, default warning |
shortname |
If type=ontology , the short name of the target ontology (see Graph configuration). |
string, example sfwo |
mapping_file |
If type=map , the name of a YAML file containing label-IRI mappings. |
path, example mapping.yml |
Below is an example of an [annotators]
section containing an example declaration for each type of semantic annotator:
[annotators]
[annotators.TaxonAnnotator]
type=taxonomy
filter_on_ranks=["Metazoa", "Animalia"]
targets=["NCBI", "GBIF", "OTT"]
[annotators.YAMLMap]
type=map
mapping_file=mapping.yml
[annotators.SFWO]
type=ontology
shortname=sfwo
Note on the different types of semantic annotators: [TODO]
This section allows you to configure the part of the pipeline responsible for extracting or copying raw data from the data sources and storing it in a staging area. A staging area is an intermediate storage location for the temporary storage of extracted and transformed data.
In the current version of inteGraph, data can be extract data from two types of data sources:
This subsection contains configuration properties for extracting data from a file-like data source:
Property | Description | Values |
---|---|---|
file_path |
The local path or URL of the data file. It can be absolute or relative to the directory containing source.cfg . |
path |
file_name |
string, optional | |
file |
string, optional |
This subsection contains configuration properties for extracting data from a REST API data source:
Property | Description | Values |
---|---|---|
conn_id |
The connection identifier. | string, example globi |
endpoint |
The API endpoint. | string, optional |
query |
The query specifying what data is returned from the remote data source. | string |
headers |
Headers containing additional information about the request. | dict, optional |
limit |
The maximum number of results per page in case of paginated results. | int, optional, default None |
This section allows you to configure the part of the pipeline responsible for transforming the extracted data stored in the staging area into an RDF graph.
Property | Description | Values |
---|---|---|
format |
The format of the extracted data. | {csv} |
delimiter |
The character to treat as the delimiter/separator. | string, optional, default "," |
chunksize |
The size of the data chunks processed in parallel. | int, optional, default 1000 |
Data transformation involves a series of operations, some of which are optional:
cleanse
)ets
)annotate
)triplify
)This subsection is optional. It is used to specify the path to an external script (Python or R) containing data cleansing operations specific to the data source.
Property | Description | Values |
---|---|---|
script |
The path to a Python or R script containing data cleansing operations. It can be absolute or relative to the directory containing source.cfg . |
path, optional |
Note on writing a valid data cleansing script : [TODO]
This subsection is optional. It provides configuration properties for formatting data in accordance with the Ecological Trait-data Standard.
Property | Description | Values |
---|---|---|
na |
The string(s) to recognize as NaN. | string or list, optional |
taxon_col |
The name of the column contaning scientific names for taxa. | string, optional |
measurement_cols |
The names of the columns contaning the measured trait values. | list, optional |
additional_cols |
The names of additional columns to be retained in the input data. | list, optional |
This subsection is used to associate a semantic annotator (or a sequence of annotators) with a subset of your data describing a specific entity, e.g. a taxon, an ecological interaction, a functional trait, etc.
To annotate an entity, add a new subsection [transform.annotate.MyEntity]
where MyEntity
should be a unique identifier for the entity. This subsection can contain the following properties:
Property | Description | Values |
---|---|---|
label |
The name of the column containing the label of the entity. | string, optional |
id |
The name of the column containing the identifier of the entity in the source taxonomy/ontology. | string, optional |
annotators |
An ordered list of semantic annotators. | list, example ["SFWO", "YAMLMap"] |
This subsection lets you specify the path to the spreadsheet containing your RDF mapping rules. These are rules used to transform tabular data into a RDF graph. inteGraph uses Mapeathor to translate mapping rules specified in spreadsheets to RDF Mapping Language (RML) rules, and Morph-KGC to execute the RML rules and construct (materialize) the RDF graph.
Property | Description | Values |
---|---|---|
mapping |
The path to the spreadsheet containing mapping rules. It can be absolute or relative to the directory containing source.cfg . |
path, example mapping.xlsx |
Below is a complete example of a source configuration file for the BETSI database:
[source]
id=betsi
[source.metadata]
title=A database for soil invertebrate biological and ecological traits
creator=Hedde et al
subject=araneae, carabidae, chilopoda, diplopoda, earthworms, isopoda
description=The Biological and Ecological Traits of Soil Invertebrates database (BETSI, https://portail.betsi.cnrs.fr/) is a European database dedicated specifically to soil organisms’ traits.
date=2021
format=csv
identifier=hal-03581637
language=en
[annotators]
[annotators.TaxonAnnotator]
type=taxonomy
filter_on_ranks=["Metazoa", "Animalia"]
targets=["NCBI", "GBIF", "OTT"]
[annotators.YAMLMap]
type=map
mapping_file=mapping.yml
[annotators.SFWO]
type=ontology
shortname=sfwo
[extract]
[extract.file]
file_path=data/BETSI_220221.csv
[transform]
format=csv
delimiter=","
chunksize=1000
[transform.cleanse]
script="clean.py"
[transform.ets]
na=NA
[transform.annotate]
[transform.annotate.taxon]
label=taxon_name
annotators=["TaxonAnnotator"]
[transform.annotate.trait]
label=attribute_trait
annotators=["YAMLMap", "SFWO"]
[transform.triplify]
mapping=mapping.xlsx
Coming soon.