Open Linguistic: Greek Wordnet

Nowadays, Web is one of the most important means of sharing knowledge. Information and data from scientific fields and not only, are accessible by every user. A scientific field represented by a large amount of data on the Web is Linguistics, which is the study of human language. However, these data is most useful when linked in order to achieve better results.

In that field, OKFN created the Open Linguistic working group του OKFN, in order to promote open linguistic linked data. OKFN Greece actively participates in the activities of this group, by publishing the first Greek linguistic dataset in the Web 3.0. In particular, OKFN Greece, in cooperation with the Webscience Master Program, School of Mathematics, Aristotle University of Thessaloniki, published the well-known Wordnet data for the Greek language under the principals of Linked Data

OKFN Greece in cooperation with the Master in Web Science of Mathematics Department of Aristotle University of Thessaloniki, contributed to the publication of data under the Principles of Linked Data of one main representative of Linguistic Data, Wordnet, for the Greek language.

WordNet is a lexical database of English. Words are organized into sets of synonyms (called synsets), each representing one underlying lexical concept. Brief definitions are provided while different lexical and semantic relations link the synonym sets. It was created at the University of Princeton in 1985 under the direction of George A. Miller, a psychology professor, whose design is inspired by experiments in Artificial Intelligence that tried to comprehend human semantic memory. Aims to combine dictionary and thesaurus, and to support automatic text analysis and artificial intelligence applications. Through the years, government bodies that wanted to foster machine translation funded the project and so, similar projects were created for many languages, including Greek which was developed under an EC funded project.

The project called Balkanet (September 2001-August 2004) and extended the European languages that were developed through EuroWordNet importing six Balkan languages (namely Bulgarian, Greek, Romanian, Serbian, Turkish and Czech). Greek WordNet was developed in the DataBase Sysemts Laboratory (DBLab ), University of Patras from a team of linguists with the attendance of University of Athens. BalkaNet’s major ambition though is to semantically correlate words in each Balkan language and link them together, in order to create an online multilingual semantic network. The application we developed serves the first part of the vision and is easily adapted to any language without any additional programming skills or hardware requirements.

Specifically, in late Augus the DataBase Sysemts Laboratory (DBLab ) gave us an XML formatted file that was produced by the Greek Wordnet Database, and contained 18.461 synsets. Each synset distinguished depending on what part of speech is in noun, verb, adverb and object. Most sets of synonyms are related to other synsets via a number of semantic relations. These relations vary based on the type of word, apply to all members of a synset (they share a meaning) and include relations such as hypernym and hyponym. Words can also be connected to other words through lexical relation, e.g. synonym, antonym. The polysemy count of a word is also given, representing the number of synsets that contain the word.

Below is a sample of the XML file for the synset “άστρο-noun-1”:

<SYNSET><ID>ENG20-08850126-n</ID><POS>n</POS><SYNONYM><LITERAL>άστρο<SENSE>1</SENSE><LNOTE>a’stro</LNOTE></LITERAL><LITERAL>αστέρας<SENSE>1</SENSE><LNOTE>aste’ras</LNOTE></LITERAL><LITERAL>αστέρι<SENSE>1</SENSE><LNOTE>aste’ri</LNOTE></LITERAL></SYNONYM><ILR>ENG20-08664330-n<TYPE>hypernym</TYPE></ILR><ILR>ENG20-07771273-n<TYPE>holo_member</TYPE></ILR><ILR>ENG20-08675663-n<TYPE>holo_member</TYPE></ILR><ILR>ENG20-05731244-n<TYPE>category_domain</TYPE></ILR><DEF>κάθε αυτόφωτο ουράνιο σώμα που ακτινοβολεί χάρη στις εσωτερικές θερμοπυρηνικές πηγές ενέργειας τις οποίες έχει</DEF><BCS>2</BCS></SYNSET>

Every synset is described by these tags ,the meaning of which is: SYNSET: contains all the data relative to Synset, ID : identifier of the ILI. The prefix ENG20 means that it had been created by the Princeton WordNet, version 2.0, while the prefix BILI means that the synset is a BalkaNet specific one, POS : part of speech (the possible values are n : noun, v : verb, b : adverb, a : adjective), SYNONYM : list of the literals of this synset. At least one literal is mandatory, LITERAL : wording of the literal, SENSE : number used for the sense differentiation, LNOTE : note about this literal, Def : gloss of the synset. This wording allows to describe the synset. It’s not mandatory, STAMP : gives some additional information about this synset : author, date…,USAGE : gives an example of use of the synset, BCS : represents the core set of concepts to be encoded within the Greek WordNet (the possible values are 1, 2 or 3), ILR : Interlingua relation. Gives a relation between this synset and the specified ILI, TYPE : type of this relation.

The process of converting data into RDF comprises the following steps. An rdf-izer in the C++ programming language was developed in order to convert the XML file in RDF. The rdfizer reads the XML file, assigns an IRI to each entity, produces the entities of WordeSense and Word and makes connections between the synsets and other resources both interlink and intralink. This way generated the corresponding triplets. The converted data should be accessible on the Web; therefore, N3 was chosen for data publishing and all IRIs in the converted data are dereferenceable via HTTP protocol following the converting model used at English Wordnet 2.0 of Princeton. For example, the IRI for synset would take the form:

http://wordnet.okfn.gr/resource/synset-first_literal_of_synset-pos-sense

Μappings are defined by a simple configuration file, some of which are shown in the following table.

XML ELEMENT	MAPPED PROPERTY	CLASS
ID	wn20s:synsetId	Synset
DEF	wn20s:gloss	Synset
SENSE	wn20s:sense	WordSense
LITERAL	rdfs:label	Synset,WordSense,Word

The ontology that was used is Wordnet 2.0 RDF/OWL Full ontology(W3C-Mark van Assem-http://www.w3.org/TR/wordnet-rdf/).We also used the ontologies RDF Schema and WNGRE, an extension of WordNet Full in order to include some additional properties. The converted dataset contribute a total of 172.066 triples involving 106.432 properties and 18.457 sameAs links.

The following illustration shows the representation of the synset “άστρο” after the conversion process.

If you follow the link http://wordnet.okfn.gr/page/synset-άστρο-noun-1 you will see more this example, while if you choose the allowed options you will be able to see the synsets linked with it. It is also possible to download a file of a synset in various formats including CSV, XML and others.

The RDFizing Code in OKFN Greece github

References:

Sofia Stamou-Goran Nenadic-Dimitris Christodoulakis. Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing, LREC, European Language Resources Association, (2004), http://www.dblab.upatras.gr/balkanet/pubs/lrec2004.pdf
Kontokostas D., Bratsas C., Auer S., Hellmann S., Antoniou I., Metakides G., 2012, Internationalization of Linked Data. The case of the Greek DBpedia edition. In the Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Volume 15, Sept 2012, pp. 51–61, http://dx.doi.org/10.1016/j.websem.2012.01.001.
C. Chiarcos, S. Hellmann, et al. The Open Linguistics Working Group. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 2012a.
C. Chiarcos, S. Nordhoff, and S. Hellmann, Linked Data in Linguistics, pages 161–179, Heidelberg, 2012. Springer,http://ldl2012.lod2.eu/program/proceedings
Christian Chiarcos,John McCrae, Philipp Cimiano, and Christiane Fellbaum. Towards Open Data for Linguistics: Linguistic Linked Data
Christian Bizer and Richard Cyganiak. D2R Server – Publishing Relational Databases on the Semantic Web table
Pablo N. Mendes, Max Jakob and Christian Bizer. DBpedia for NLP: A Multilingual Cross-domain Knowledge Base. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2012
Sören Auer und Sebastian Hellmann. The Web of Data: Decentralized, collaborative, interlinked and interoperable In: LREC 2012,http://www.lrec-conf.org/proceedings/lrec2012/keynotes/LREC%202012.Keynote%20Speech%201.Soeren%20Auer.pdf
http://sabre2012.infai.org/mlode

Open Linguistic: Greek Wordnet

Subscribe To Our Newsletter

You have Successfully Subscribed!