Paper details

Title: Enhancing toponym identification: Leveraging Topo-BERT and open-source data to differentiate between toponyms and extract spatial relationships

Authors: Joseph Shingleton, Ana Basiri

Abstract: Obtained from CrossRef

Abstract. Geoparsing, the process of linking locations within text to sets of geographic coordinates, plays an important role in the extraction and analysis of information from unstructured textual data. With the rapid growth in availability of user-generated data from online sources, there is increasing demand for reliable geoparsing methods. Central to many of these methods is the accurate identification of toponyms within text. For some applications, however, simple identification of toponyms is insufficient. Problems which require the association of a piece of text containing multiple toponyms to a singular location require a more nuanced approach. In this paper, we show that a transformer based deep learning model, is able to identify the subject toponym within a given text, and classify other toponyms in terms of their spatial relationship with the subject. We curate a dataset of text taken from Wikipedia pages representing 5252 locations, and use OpenStreetMap data to classify toponyms within the text in terms of their spatial relationship with the subject of each article. This dataset is then used to train a transformer based deep-learning model. On a human labelled test set, our model achieves an F1 score of 0.916 when identifying the subject toponym, and 0.884 and 0.793 when identifying toponyms representing parent and child locations of the subject, respectively. We also consider the more complex adjacent and crossing relationships - with the model achieving F1 scores of 0.548 and 0.704 in these categories, respectively.

Codecheck details

Certificate identifier: 2024-010

Codechecker name: Rémy Decoupes

Time of codecheck: 2024-05-27 10:26:00

Repository: https://osf.io/nbk57

Codecheck report: https://doi.org/10.17605/osf.io/NBK57

Summary:

As indicated in the Data and Software Availability section, the authors shared their code, data, and trained models through an OSF (Open Science Framework) repository. Through 4 notebooks, we were able to train two baseline models and then create a new training dataset to train the proposed model by the authors. These models were then compared with human evaluation (through shared data). Evaluating the reproducibility of this article was not an easy task. In fact, this processing chain requires a lot of computational resources and time for its execution. Another difficulty was that the notebooks and Python library developed by the authors and shared via OSF contained some errors. However, the authors accompanied me throughout this process, providing new versions of the code files to correct the errors I encountered. My feeling is that the reproducibility review process was beneficial. The scientific article was almost entirely reproduced


https://codecheck.org.uk/ | GitHub codecheckers

© Stephen Eglen & Daniel Nüst

Published under CC BY-SA 4.0

DOI of Zenodo Deposit

CODECHECK is a process for independent execution of computations underlying scholarly research articles.