Automatic techniques to enrich the Agrovoc vocabulary.
AGROVOC. Since the early 1980's, the Food and Agriculture Organization of the United Nations (FAO) has coordinated AGROVOC, a valuable tool for data to be classified homogeneously. AGROVOC is the largest Linked Open Data set about agriculture available for public use and its highest impact is through facilitating the access and visibility of data across domains and languages..
PhD proposal aims. This PhD proposal aims to develop automated techniques to enrich the Agrovoc ontology with: 1) new names for some existing concepts in the ontology. 2) new relevant concepts (absent in the ontology). For learning new names of existing concepts, the approach will be mainly based on both the lexical properties of the ontology terms, and the hierarchical and logical structure of the ontology..
Methodology:. Agrovoc synonyms Non-redundant synonyms Lexical overlaps Candidate synonyms Generated Synonym S Ruling out redundant synonyms Identifying lexical overlaps Generating candidate synonyms Agrovoc hierarchical relationships Ruling out nonsensical candidates AGRIS Inferring Synonym scopes N ew synonyms.
First: An automated synonym-substitution method to enrich the ( Agrovoc ) with new synonyms.
Then:. We can replace the Word food in each Narrower Concepts into foodstuffs . The new synonym result will be shown in table:.
Second: Filter the meaningless synonyms.. This terms were filtered using a pre-trained language model for agriculture i used a transformer language model which is a type of machine learning . Transformer are models that can be designed to translate text, write poems and op eds, and even generate computer code. Transformers boils down to three main concepts : Positional Encodings : Which is used to understanding word order from the structure of the neural network to the data itself. Attention : is a mechanism that allows a text model to “look at” every single word in the original sentence when making a decision about how to translate words in the output sentence. Self-Attention : it help neural networks disambiguate words, do part-of-speech tagging, entity resolution, learn semantic roles and a lot more. The best transformers is Bidirectional Encoder Representations from Transformers (BERT) ..
BERT. BERT joined Google in 2018, which used for google search to help understanding search queries. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts BERT is system tools used to solve problem in language like: - text summarization - question answering - classification - named entity resolution - text similarity - offensive message/profanity detection - understanding user queries - a whole lot more.
How can we capture similarity between two words ???.
BERT Process. BERT create vector for all word and encoding it (Generate Feature vector or word embedding for each words). Then by compare each words by feature vector(word embedding) we can determine the kind similar..
Text whiteboard Description automatically generated.