Round 3: Completed Weight: 10.0

Column-Type Annotation (CTA) Challenge

SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching

9233

1446

NEWS: Please join our discussion group and visit our website

This is a task of ISWC 2020 challenge “SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching”. It’s to annotate an entity column (i.e., a column composed of entity mentions) in a table with types of a knowledge graph (KG) such as DBpedia and Wikidata.

Task Description

The task is to annotate each entity column by components of a KG as its type. Each column can be annotated by multiple classes: the one that is as fine grained as possible and correct to all the column cells, is regarded as a perfect annotation; the one that is the ancestor of the perfect annotation is regarded as an okay annotation; others are regarded as wrong annotations.

The cases for DBpedia and Wikidata are a bit different. Please refer to corresponding task description and evaluation metrics for the KG used in each dataset (round).

For Wikidata, the annotation can be a normal item such as https://www.wikidata.org/wiki/Q6256. Each column should be annotated by at most one item. A perfect annotation is encouraged with a full score, while an okay annotation can still get a part of the score. Eexample: "KIN0LD6C","0","http://www.wikidata.org/entity/Q8425".

For DBpedia, the annotation should be from DBpedia ontology classes, but excludes owl:Thing and owl:Agent. Multiple annotations can be annotated, and they should be separated by a space where the order does not matter. Example: "9206866_1_8114610355671172497","0","http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/Place".

In both cases, the annotation should be represented by its full URI (the case is NOT sensitive). Each submission should be a CSV file. Each line should include a column identified by table id and column id and its annotation(s). It means one line should include three fields: “Table ID”, “Column ID” and “Annotation URI”. The headers should be excluded from the submission file.

Notes:

1) Table ID is the filename of the table data, but does NOT include the extension.

2) Column ID is the position of the column in the input, starting from 0, i.e., first column’s ID is 0.

3) One submission file should have NO duplicate lines for each target column.

4) Annotations for columns out of the target columns are ignored.

Datasets

Table set for Round #1: Tables, Target Columns, KG: Wikidata

Table set for Round #2: Tables, Target Columns

Table set for Round #3: Tables, Target Columns

Table set for Round #4: Tables, Target Columns

Data Description: The table for Round #1 is generated from Wikidata (Version: March 5, 2020). One table is stored in one CSV file. Each line corresponds to a table row. The first row may either be the table header or content. The target columns for annotation are saved in a CSV file.

Evaluation Criteria

We use different metrics for DBpedia and Wikidata. Please calculate the correct metrics according to the KG given in the dataset for each round.

For Wikidata, we encourage one perfect annotation, and at same time score one of its ancestors (okay annotation). Thus we calculate Approximate Precision (APrecision), Approximate Recall (ARecall), and Approximate F1 Score (AF1):

$A P r e c i s i o n = \frac{\sum_{a \in a l l a n n o t a t i o n s} g (a)}{a l l a n n o t a t i o n s #}$

$A R e c a l l = \frac{\sum_{c o l \in a l l t a r g e t c o l u m n s} (m a x_a n n o t a t i o n_s c o r e (c o l))}{a l l t a r g e t c o l u m n s #}$

$A F 1 = \frac{2 \times A P r e c i s i o n \times A R e c a l l}{A P r e c i s i o n + A R e c a l l}$

Notes:

1) # denotes the number.

2) $g (a)$ returns the full score $1.0$ if $a$ is a perfect annotation, returns ${0.8}^{d (a)}$ if $a$ is an ancestor of the perfect annotation and its depth to the perfect annotation $d (a)$ is not larger than 5, returns ${0.7}^{d (a)}$ if $a$ is a descendent of the perfect annotation and its depth to the perfect annotation $d (a)$ is not larger than 3, and returns 0 otherwise. E.g., $d (a) = 1$ if $a$ is a parent of the perfect annotation, and $d (a) = 2$ if $a$ is a grandparent of the perfect annotation.

3) $m a x_a n n o t a t i o n_s c o r e (c o l)$ returns $g (a)$ if $c o l$ has an annotation $a$ , and 0 of $c o l$ has no annotation.

4) $A F 1$ is used as the primary score, and $A P r e c i s i o n$ is used as the secondary score.

For DBpedia, the following metrics named Average Hierarchical Score (AH_Score) and Average Perfect Score (AP_Score) are calculated for ranking:

$A H_S c o r e = \frac{1 \times (p e r f e c t a n n o t a t i o n s #) + 0.5 \times (o k a y a n n o t a t i o n s #) - 1 \times (w r o n g a n n o t a t i o n s #)}{t a r g e t c o l u m n s #}$

$A P_S c o r e = \frac{p e r f e c t a n n o t a t i o n s #}{a l l a n n o t a t i o n s #}$

Notes:

1) # denotes the number.

2) AH_Score is used as the primary score to encourage as more correct annotations as possible; AP_Score is used as the secondary score.

3) See more details of the metrics in the resource paper SemTab 2019.

Submission

1. One participant is allowed to make at most 5 submissions per day in Round #1 and #2

Tentative Dates

1. Round #1: 26 May to 20 July

2. Round #2: 25 July to 30 Aug

3. Round #3: 3 September to 17 September

4. Round #4: 20 September to 4 October

Rules

Selected systems with the best results will be invited to present their results during the ISWC conference and the Ontology Matching workshop.
Participants are encouraged to submit a system paper describing their tool and the obtained results. Papers will be published online as a volume of CEUR-WS as well as indexed on DBLP. By submitting a paper, the authors accept the CEUR-WS and DBLP publishing rules.
Please see additional information at our official website