Automatic techniques to enrich the Agrovoc vocabulary.
AGROVOC. Since the early 1980's, the Food and Agriculture Organization of the United Nations (FAO) has coordinated AGROVOC, a valuable tool for data to be classified homogeneously. AGROVOC is the largest Linked Open Data set about agriculture available for public use and its highest impact is through facilitating the access and visibility of data across domains and languages..
NER & Ontologies. Graphical user interface, application Description automatically generated.
PhD proposal aims. This PhD proposal aims to develop automated techniques to enrich the Agrovoc ontology with: 1) new names for some existing concepts in the ontology. 2) new relevant concepts (absent in the ontology). For learning new names of existing concepts, the approach will be mainly based on both the lexical properties of the ontology terms, and the hierarchical and logical structure of the ontology..
Methodology:. Agrovoc synonyms Non-redundant synonyms Lexical overlaps Candidate synonyms Generated Synonym S Ruling out redundant Agrovoc synonyms Identifying lexical overlaps Generating candidate synonyms Agrovoc hierarchical relationships Ruling out nonsensical candidates AGRIS Inferring Synonym scopes N ew synonyms.
First: An automated synonym-substitution method to enrich the ( Agrovoc ) with new synonyms.
Then:. We can replace the Word food in each Narrower Concepts into foodstuffs . The new synonym result will be shown in table:.
Second: Filter the meaningless synonyms.. This terms were filtered using a pre-trained language model for agriculture i used a transformer language model which is a type of machine learning . Transformer are models that can be designed to translate text, write poems and op eds, and even generate computer code. Transformers boils down to three main concepts: Positional Encodings : Which is used to understanding word order from the structure of the neural network to the data itself. Attention : is a mechanism that allows a text model to “look at” every single word in the original sentence when making a decision about how to translate words in the output sentence. Self-Attention : it help neural networks disambiguate words, do part-of-speech tagging, entity resolution, learn semantic roles and a lot more. The best transformers is Bidirectional Encoder Representations from Transformers (BERT) ..
BERT. BERT joined Google in 2018, which used for google search to help understanding search queries. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts BERT is system tools used to solve problem in language like: - text summarization - question answering - classification - named entity resolution - text similarity - offensive message/profanity detection - understanding user queries - a whole lot more.
BERT for Agriculture Domain. A BERT-based language model further pre-trained from the checkpoint of SciBERT . The dataset gathered is a balance between scientific and general works in agriculture domain and encompassing knowledge from different areas of agriculture research and practical knowledge. The corpus contains 1.2 million paragraphs from National Agricultural Library (NAL) from the US Gov. and 5.3 million paragraphs from books and common literature from the Agriculture Domain. The self-supervised learning approach of MLM was used to train the model..
BERT. BERT was pretrained with two objectives: Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not..
How can we capture similarity between two words ???.
BERT Process. BERT create vector for all word and encoding it (Generate Feature vector or word embedding for each words). Then by compare each words by feature vector(word embedding) we can determine the kind similar..
BERT can generate a contextualized embedding.. Graphical user interface Description automatically generated with medium confidence.
Text Description automatically generated. In the same time the word “unbiased” is similar with word “fair”. They have equivalent and approach vectors . Again the word “Carnival” is similar with word “fair”. They have equivalent and approach vectors . BERT is powerful , it can look a context of statement and generate the meaningful number..
Text whiteboard Description automatically generated.