SD ID: TEXTCROWD - Collaborative semantic enrichment of text-based datasets


CONTACT: Franco Niccolucci

Email: franco.niccolucci(at)gmail.com


OVERVIEW: The Social Sciences and Humanities research communities face a fragmented research landscape that can be supported by EOSC. The EOSC would help overcome such fragmentation, by building on structuring and integrating initiatives such as the CLARIN, DARIAH and E-RIHS ERICs, and Digital Humanities Organizations (e.g. their Association ADHO) to offer advanced text-based services addressing common research needs (see recent survey by PARTHENOS). One example is enabling the semantic enrichment of text sources through cooperative, supervised crowdsourcing, based on shared semantics, and then to make this work available to others via EOSC. This would benefit many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism.

TEXTCROWD is an advanced cloud-based tool developed within the framework of the EOSCpilot project to process textual, archaeological reports. The tool has been boosted and made capable of browsing large online knowledge repositories, training itself on demand and used to produce semantic metadata ready to be integrated with information coming from different domains, to establish an advanced machine-learning scenario.

OBJECTIVES: TEXTCROWD is a tool that allows analyzing openly available archaeological text documents, pointing out concepts about space, time and artifacts and the relations between them, in order to index the documents. The main purpose is to produce and accumulate collections of semantically enriched texts based on domain ontologies and thesauri. At present, such texts use global descriptive metadata without a semantic structure: researchers will generate and revise the enriched documents using the tools provided, and these texts will then be available for direct searching by others thus increasing the total number of enriched and searchable texts.

MAIN ACHIEVEMENTS: Among the main results of TEXTCROWD, the following are worth a mention:

  • Semantic enrichment of text documents thanks to the creation of metadata to improve indexing, discoverability, accessibility and reusability
  • Recognition and automatic annotation of entities and concepts in texts with machine learning techniques
  • Knowledge extraction in a semantic format using renowned standards»Cloud architecture based on an easy-to-use virtual research environment
  • Interoperability of extracted knowledge to increase reusability in other projects
  • FAIR principles implementation to improve results accessibility
  • Availability of the tool to the broader scientific community via other projects

RECOMMENDATIONS: Improve the usability of cloud infrastructures on the side of user interfaces and reusability of components; provide facilities to support a modular approach, build the infrastructure centered around the needs of the user community; ensure interoperability by using open standards and guarantee reproducibility thanks to the source documents.

CONTEXT: Cultural heritage and humanities datasets are largely based on texts:

  • Reports
  • Archaeology: excavations, surveys
  • Conservation: diagnosis, restoration – often mixed with numeric results
  • Grey literature
  • Literary/historical sources
  • Research articles
  • Monographs

Download TEXTCROWD Presentation by Kathrin Beck (MPCDF)
Read the TEXTCROWD Success Story by Franco Niccolucci (PIN)