Master's Thesis presentation. Aarav is advised by Dr. Felix Dietrich.
Previous talks at the SCCS Colloquium
Aarav Malik: Question generation and answering in electrical power system components domain
SCCS Colloquium |
Obtaining training data for the Question Answering (QA) task is a time-consuming and resource-intensive task. While there are some domains for which such datasets exist, there are no such datasets for the Electric Power System Components domain. Siemens has use cases where they can use QA models to extract relevant information from the manuals of such components. In this work, we explore the possibility of synthetically generating Question and Answer pairs using an unsupervised NMT model in a low resource setting. We approach this by building a paragraph corpus in the Electric Power Systems Components domain. We use the UNMT model to generate context, question, and answer triples that make up for our synthetic training data-set. UNMT model does so by randomly sampling paragraphs and then randomly sampling named entities or noun phrases as answers. It, then, masks the answers and turns them into "fill-in-the-blank" cloze questions, and finally, it translates them into a natural question. Then we fine-tune three state-of-the-art pre-trained transformer-based models on this synthetic training data for the downstream task of question answering. To evaluate our approach we also curate a ground truth data-set of manually labeled question and answer pairs. We find that QA models trained on synthetic training data answer human questions quite well. With this approach all three models achieved between 83.5 F1 and 85 F1 scores.