User manual

« Back to homepage

Installation
Running inteGraph
Create a new project
Pipeline execution and monitoring

Installation

Clone the project repository

$ git clone https://github.com/nleguillarme/inteGraph.git

Run install.sh

$ cd inteGraph ; sh install.sh

Running inteGraph

To run inteGraph in Docker you just need to execute the following:

$ make up

This will create and start a set of containers that you can list with the docker ps command:

$ docker ps
CONTAINER ID   IMAGE                         COMMAND                  CREATED         STATUS                   PORTS                                       NAMES
21f4c8d28833   integraph-airflow-webserver   "/usr/bin/dumb-init …"   4 minutes ago   Up 3 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   integraph-airflow-webserver-1
ab066b04f9b8   integraph-airflow-scheduler   "/usr/bin/dumb-init …"   4 minutes ago   Up 3 minutes (healthy)   8080/tcp                                    integraph-airflow-scheduler-1
3a5a84452e98   gnames/gognparser:latest      "gnparser -p 8778"       4 minutes ago   Up 4 minutes             0.0.0.0:8778->8778/tcp, :::8778->8778/tcp   integraph-gnparser-1
70b2ec620cda   postgres:13                   "docker-entrypoint.s…"   8 minutes ago   Up 8 minutes (healthy)   5432/tcp                                    integraph-postgres-1

In particular, this starts an instance of the Airflow scheduler and webserver. The webserver is available at http://localhost:8080.

To exit inteGraph and properly close all containers, run the following command:

$ make down

Create a new project

The structure of a typical inteGraph project looks like the following:

my-project/
|-- graph.cfg
|-- connections.json
|-- sources.ignore
|
|-- sources/
|   |-- source_1/
|   |   |-- source.cfg
|   |   |-- mapping.xlsx
|   |   
|   |-- source_2/
|   |   |-- source.cfg
|   |   |-- mapping.xlsx

Graph configuration: `graph.cfg`

inteGraph uses an INI-like file to configure the knowledge graph creation process. This configuration file can contain the following sections:

[graph]

Property	Description	Values
`id`	The base IRI of the knowledge graph. It will be used to generate a graph label IRI for each data source.	iri, example `http://leca.osug.fr/my_kg`

[sources]

This section can be empty, in which case inteGraph will use the default property values.

Property	Description	Values
`dir`	The path to the directory containing the configuration of the data sources. It can be absolute or relative to the directory containing `graph.cfg`.	path, default `sources`

[load]

This section contains configuration properties for connecting to the triplestore. The properties in this section vary depending on the triplestore implementation.

Property	Description	Values
`id`	The identifier of the triplestore implementation.	{`graphdb`, `rdfox`}
`conn_type`	The type of connection.	{`http`}
`host`	The URL or IP address of the host.	url or ip, example `0.0.0.0`
`port`	The port number.	int, example `7200`
`user`	The user login. The user should have read/write permissions to the repository.	str, optional, example `integraph`
`password`	The user password.	str, optional, example `p@ssw0rd`
`repository`	The identifier of the target repository.	str, example `my-kg`

[ontologies]

This section is optional and is used to declare the ontologies that will be searched during the semantic annotation of your data. Each line is a key-value pair shortname=iri where shortname will be used as the ontology’s internal identifier and iri is the ontology’s IRI or a valid path to the ontology.

Below is an example graph configuration file:

[graph]
id=http://leca.osug.fr/my_kg

[sources]
dir=sources

[load]
id=graphdb
conn_type=http
host=129.88.204.79
port=7200
repository=my-kg

[ontologies]
sfwo="http://purl.org/sfwo/sfwo.owl"

Data source configuration

Each data source to be integrated into the knowledge graph must have its own INI-like configuration file. To add a new data source to your inteGraph project, create a new directory in your sources directory, and create a file source.cfg in this directory. This configuration file can contain the following sections:

[source]

Property	Description	Values
`id`	The internal unique identifier of the data source.	str, example `source_1`

[source.metadata]

This subsection can be empty. You can use it to specify metadata about the data source using any of the fifteen terms in the Dublin Core™ Metadata Element Set (also known as “the Dublin Core”).

Below is an example [source.metadata] section specifying metadata for the BETSI database:

[source.metadata]
title=A database for soil invertebrate biological and ecological traits
creator=Hedde et al.
subject=araneae, carabidae, chilopoda, diplopoda, earthworms, isopoda
description=The Biological and Ecological Traits of Soil Invertebrates database (BETSI, https://portail.betsi.cnrs.fr/) is a European database dedicated specifically to soil organisms’ traits.
date=2021
format=csv
identifier=hal-03581637
language=en

[annotators]

This section allows you to define any number of semantic annotators. The role of a semantic annotator is to match a piece of data with the concept in the target ontology/taxonomy that best captures its meaning.

To create a new annotator, add a new subsection [annotators.MyAnnotator] where MyAnnotator should be a unique identifier for the annotator. This subsection can contain the following properties:

Property	Description	Values
`type`	The type of the semantic annotatior.	{`taxonomy`, `ontology`, `map`}
`source`	If `type=taxonomy`, the identifier of the taxonomy used by the data source.	{`NCBI`, `GBIF`,`IF`}, optional
`targets`	If `type=taxonomy`, an ordered list containing the identifiers of the target taxonomies.	list, optional, default `["NCBI", "GBIF", "IF", "EOL", "OTT"]`
`filter_on_ranks`	If `type=taxonomy`, a list of taxa which will be used to restrict the search when trying to match taxa on the basis of their scientific names.	list, optional, example: `["Eukaryota", "Protista", "Protozoa"]`
`multiple_match`	If `type=taxonomy`, how multiple matches are handled.	{`strict`, `warning`, `ignore`}, optional, default `warning`
`shortname`	If `type=ontology`, the short name of the target ontology (see Graph configuration).	string, example `sfwo`
`mapping_file`	If `type=map`, the name of a YAML file containing label-IRI mappings.	path, example `mapping.yml`

Below is an example of an [annotators] section containing an example declaration for each type of semantic annotator:

[annotators]

[annotators.TaxonAnnotator]
type=taxonomy
filter_on_ranks=["Metazoa", "Animalia"]
targets=["NCBI", "GBIF", "OTT"]

[annotators.YAMLMap]
type=map
mapping_file=mapping.yml

[annotators.SFWO]
type=ontology
shortname=sfwo

Note on the different types of semantic annotators: [TODO]

[extract]

This section allows you to configure the part of the pipeline responsible for extracting or copying raw data from the data sources and storing it in a staging area. A staging area is an intermediate storage location for the temporary storage of extracted and transformed data.

In the current version of inteGraph, data can be extract data from two types of data sources:

File-like data sources: inteGraph can download files from local or remote data sources by specifying the local path or URL of the data file. Archive files, including compressed archives, are supported.
REST API data sources: inteGraph can retrieve data from remote data sources accessible via REST APIs. Paginated results are supported, provided the API uses offset-based pagination (using the limit and offset request parameters).

[extract.file]

This subsection contains configuration properties for extracting data from a file-like data source:

Property	Description	Values
`file_path`	The local path or URL of the data file. It can be absolute or relative to the directory containing `source.cfg`.	path
`file_name`		string, optional
`file`		string, optional

[extract.api]

This subsection contains configuration properties for extracting data from a REST API data source:

Property	Description	Values
`conn_id`	The connection identifier.	string, example `globi`
`endpoint`	The API endpoint.	string, optional
`query`	The query specifying what data is returned from the remote data source.	string
`headers`	Headers containing additional information about the request.	dict, optional
`limit`	The maximum number of results per page in case of paginated results.	int, optional, default `None`

[transform]

This section allows you to configure the part of the pipeline responsible for transforming the extracted data stored in the staging area into an RDF graph.

Property	Description	Values
`format`	The format of the extracted data.	`{csv}`
`delimiter`	The character to treat as the delimiter/separator.	string, optional, default `","`
`chunksize`	The size of the data chunks processed in parallel.	int, optional, default `1000`

Data transformation involves a series of operations, some of which are optional:

data cleansing (cleanse)
format standardization (ets)
semantic annotation (annotate)
RDF graph materialization (triplify)

[transform.cleanse]

This subsection is optional. It is used to specify the path to an external script (Python or R) containing data cleansing operations specific to the data source.

Property	Description	Values
`script`	The path to a Python or R script containing data cleansing operations. It can be absolute or relative to the directory containing `source.cfg`.	path, optional

Note on writing a valid data cleansing script : [TODO]

[transform.ets]

This subsection is optional. It provides configuration properties for formatting data in accordance with the Ecological Trait-data Standard.

Property	Description	Values
`na`	The string(s) to recognize as NaN.	string or list, optional
`taxon_col`	The name of the column contaning scientific names for taxa.	string, optional
`measurement_cols`	The names of the columns contaning the measured trait values.	list, optional
`additional_cols`	The names of additional columns to be retained in the input data.	list, optional

[transform.annotate]

This subsection is used to associate a semantic annotator (or a sequence of annotators) with a subset of your data describing a specific entity, e.g. a taxon, an ecological interaction, a functional trait, etc.

To annotate an entity, add a new subsection [transform.annotate.MyEntity] where MyEntity should be a unique identifier for the entity. This subsection can contain the following properties:

Property	Description	Values
`label`	The name of the column containing the label of the entity.	string, optional
`id`	The name of the column containing the identifier of the entity in the source taxonomy/ontology.	string, optional
`annotators`	An ordered list of semantic annotators.	list, example `["SFWO", "YAMLMap"]`

[transform.triplify]

This subsection lets you specify the path to the spreadsheet containing your RDF mapping rules. These are rules used to transform tabular data into a RDF graph. inteGraph uses Mapeathor to translate mapping rules specified in spreadsheets to RDF Mapping Language (RML) rules, and Morph-KGC to execute the RML rules and construct (materialize) the RDF graph.

Property	Description	Values
`mapping`	The path to the spreadsheet containing mapping rules. It can be absolute or relative to the directory containing `source.cfg`.	path, example `mapping.xlsx`

Below is a complete example of a source configuration file for the BETSI database:

[source]
id=betsi

[source.metadata]
title=A database for soil invertebrate biological and ecological traits
creator=Hedde et al
subject=araneae, carabidae, chilopoda, diplopoda, earthworms, isopoda
description=The Biological and Ecological Traits of Soil Invertebrates database (BETSI, https://portail.betsi.cnrs.fr/) is a European database dedicated specifically to soil organisms’ traits.
date=2021
format=csv
identifier=hal-03581637
language=en

[annotators]

[annotators.TaxonAnnotator]
type=taxonomy
filter_on_ranks=["Metazoa", "Animalia"]
targets=["NCBI", "GBIF", "OTT"]

[annotators.YAMLMap]
type=map
mapping_file=mapping.yml

[annotators.SFWO]
type=ontology
shortname=sfwo

[extract]

[extract.file]
file_path=data/BETSI_220221.csv

[transform]
format=csv
delimiter=","
chunksize=1000

[transform.cleanse]
script="clean.py"

[transform.ets]
na=NA

[transform.annotate]

[transform.annotate.taxon]
label=taxon_name
annotators=["TaxonAnnotator"]

[transform.annotate.trait]
label=attribute_trait
annotators=["YAMLMap", "SFWO"]

[transform.triplify]
mapping=mapping.xlsx

Pipeline execution and monitoring

Coming soon.