Open Topics
We offer multiple Bachelor/Master theses, Guided Research projects, and IDPs in the area of natural language processing.
These are the primary research directions our supervisors are currently focusing on. For more specific topics, please refer to the information below (alphabetical order).
- Daryna Dementieva: NLP for Social Good, fake news detection, hate and toxic speech proactive mitigation, fake news detection, multilingual NLP, explainable NLP
- Marion Di Marco: linguistic information in LLMs, translation, morphology, subword segmentation
- Lukas Edman: tokenization, low-resource pretraining of LLMs, character-level models, machine translation
- Faeze Ghorbanpour: efficient transfer learning, harmful content detection, low resource NLP, multilingual NLP, applied NLP, computational social science.
- Kathy Hämmerl: multilingual NLP, cross-lingual alignment, cultural differences, machine translation, evaluation
- Wen (Lavine) Lai: machine translation, multilingual NLP, vision-text alignment, hallucination
- Shu Okabe: NLP for very low-resource languages, parallel sentence mining, linguistic annotation generation, word and morphological segmentation, machine translation
A non-exhaustive list of open topics is listed below, together with a potential supervisor. Please contact potential supervisors directly.
How to apply:
Please, always make sure to include in your application:
- Your CV;
- Your transcript;
- Motivation to work on a topic(s) of your interest;
- Cross-check any additional requirements from the TA you are contacting.
Translation of Low-Resource Languages and Dialects
Type
Bachelor Thesis / Master Thesis / Guided Research
Prerequisites
- Experience with programming in Python
- Basic understanding of data processing
- Basic understanding of machine learning and machine translation
- (Preferred) Basic understanding of the low-resource language chosen
Description
Machine translation is near human-level performance for high-resource languages, but when it comes to low-resource languages and dialects, there is still a lot of room for improvement. The goal of this project is to work on a low-resource language or dialect of your choosing and to establish or improve on the state-of-the-art for translating to and from another language. There are several ways you can go about this, such as gathering data and constructing a new high-quality dataset using alignment methods, or transferring knowledge from a related high-resource language, among many others.
Contact
References
- (Haddow et al. 2022) https://aclanthology.org/2022.cl-3.6.pdf
- (NLLB Team et al. 2022) https://arxiv.org/pdf/2207.04672
- (Zhu et al. 2023) https://arxiv.org/pdf/2304.04675
- (Her and Kruschwitz 2024) https://arxiv.org/pdf/2404.08259
___
Parallel sentence mining for low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
- Basic understanding of machine learning and natural language processing
- Optional: interest in working on low-resource languages
Description
Parallel sentence mining aims to find translation pairs among two monolingual corpora (e.g., Wikipedia articles on the same topic but in different languages). This task constitutes a crucial step towards developing a machine translation system notably for low-resource languages, where parallel corpora are scarce.
Contact
Shu Okabe (first.last@tum.de)
References
- Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora (https://aclanthology.org/W17-2512.pdf)
- Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation (https://aclanthology.org/P19-1118.pdf)
- Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (https://aclanthology.org/2022.findings-emnlp.154.pdf)
- Boosting Unsupervised Machine Translation with Pseudo-Parallel Data (https://aclanthology.org/2023.mtsummit-research.12.pdf)
Linguistic gloss generation for very low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
- Basic understanding of machine learning and natural language processing
- Optional: interest in working on low-resource languages
Description
Linguistic glosses (or interlinear glosses) are linguistic annotations that are created to express the meaning and grammatical phenomena in a source language (e.g., 'Hund-e' in German would be annotated as 'dog-PLURAL' in English). These annotations are costly to obtain and scarce; the aim of gloss generation is hence to automatically predict glosses, especially for low-resource languages. The main challenges come from the amount of training data and the lexical diversity in sentences.
Contact
Shu Okabe (first.last@tum.de)
References
- Statistical gloss generation: Automating Gloss Generation in Interlinear Glossed Text (https://aclanthology.org/2020.scil-1.42/)
- Neural gloss generation: Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations (https://aclanthology.org/2020.coling-main.471/)
- Shared task on automatic glossing: Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing (https://aclanthology.org/2023.sigmorphon-1.20/)
- GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text (https://aclanthology.org/2024.emnlp-main.683/)
—
Multilingual (Generative) Language Modelling and Evaluation
Type
Master’s thesis / Bachelor’s thesis / Guided Research
Prerequisites
- Good machine learning knowledge
- Experience with Python and a deep learning framework (e.g., PyTorch)
- Knowledge of generative language models
Description
Multilingual models are one of our core interests in the group. In the past, we worked a lot with encoder models, for instance looking in detail at the shape and similarity of representations. We are now working more and more with decoder models, and want to test certain aspects against open LLMs, such as
(a) How anisotropic², how cross-lingually aligned¹, are representations of different languages in decoder models? How does this change with instruction tuning?
If we want to extend generative language models to more, and lower-resource, languages, multiple related questions also crop up:
(b) How do we best evaluate generation outputs for low-resource languages, or across multiple languages? Do automatic metrics still correlate well with human preferences? How much data is needed for certain automated metrics to work?³ ⁴
(c) What training schemes and data mixes are most effective for extending models?⁵ Do we need to focus on one language at a time⁶ or can multiple lower-resource languages be trained in one model?
When applying, please indicate your preferred sub-topic. What interests you about this? In addition, let me know if you have specific languages you’d like to work on.
Contact
Kathy Hämmerl (haemmerl [at] cis.lmu.de)
References
- Understanding Cross-Lingual Alignment -- A Survey (https://aclanthology.org/2024.findings-acl.649/)
- Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity (https://aclanthology.org/2023.findings-acl.439/)
- Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices (https://aclanthology.org/2024.inlg-main.44/)
- COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task (https://aclanthology.org/2022.wmt-1.52/)
- Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation (https://arxiv.org/pdf/2408.12780)
- FinGPT: Large Generative Models for a Small Language (https://arxiv.org/pdf/2311.05640)
—
Multilingual Hallucination: Detection, Evaluation and Mitigation
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites
- Enthusiasm (for publishing results at a conference/workshop)
- Proficiency in speaking and writing English
- Good Python programming background (e.g., knowledge of numpy, pandas, and Sklearn libraries)
- Basic knowledge of ML/NLP (e.g., understanding how a classifier works, knowledge of transformer architecture)
- Basic command of PyTorch and Transformers libraries is recommended
Description
Large language models (LLMs) have made remarkable strides in multilingual natural language processing (NLP), however they remain prone to hallucinations—generating factually incorrect or misleading information. This issue is particularly severe in low-resource languages, where limited training data exacerbates factual inconsistencies. We are now working on three key objectives for multilingual hallucination. (1) designing a robust detection pipeline capable of identifying hallucinations across multiple languages, (2) constructing a high-quality benchmark dataset to evaluate hallucination detection methods, and (3) developing novel mitigation strategies that integrate structured multilingual knowledge into model training and inference.
Contact
Wen Lai (wen.lai [at] tum.de)
References
- Hallucination Detection: A Probabilistic Framework Using Embeddings Distance Analysis (https://arxiv.org/abs/2502.08663)
- SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models (https://arxiv.org/abs/2502.01812)
- On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation (https://arxiv.org/abs/2410.12222)
- GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework (https://arxiv.org/abs/2407.10793)
- Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation (https://arxiv.org/abs/2502.11306)
- Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models (https://arxiv.org/abs/2502.05825)
—
In-context learning for translating low-resourced languages
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites:
- Some background in NLP
- Experience with programming in Python
- Interest in linguistics
Description
Large Language Models are trained on large amounts of unstructured data; they gain remarkable abilities to solve various linguistic tasks. However, many languages are not or only insufficiently covered in LLMs. In the absence of (large amounts of) pre-training data, descriptions of linguistic structure can provide the model with relevant information about a language unknown to an LLM, and thus improve the model's abilities to solve a task like translating from such a language.
The thesis aims at exploring the in-context learning abilities of pre-trained LLMs by making use of different types of linguistic resources, such as morpho-syntactic analysis, dictionary entries and small collections of parallel data. When instructing the LLM to translate, information relevant to the sentence to be translated is identified in a prior step, transformed into a linguistic description and presented as part of the prompt.
Contact
marion.dimarco@tum.de
References
- Court et al. (2024): Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem. https://aclanthology.org/2024.wmt-1.125.pdf
- Zhang et al. (2024):Teaching Large Language Models an Unseen Language on the Fly. https://aclanthology.org/2024.findings-acl.519/
___
Multilingual and Cross-lingual Proactive Abusive Speech Mitigation
Type
Bachelor Thesis / Master Thesis / Guided Research
Requirements
- Experience with programming in Python
- Experience in Pytorch
- Introductory knowledge of Transformers and HuggingFace
Description
Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. Thus, the development of automatic abusive speech mitigation models, especially, in robust multilingual or cross-lingual format is still an open research question. The steps of development of automatic mitigation solutions and making them proactive include:
- Robust multilingual and cross-lingual abusive/toxic/hate speech detection;
- ie develop good multilingual classifier like: https://huggingface.co/textdetox/xlmr-large-toxicity-classifier
- Text detoxification: text style transfer from toxic to non-toxic: like https://pan.webis.de/clef24/pan24-web/text-detoxification.html
- Counter speech generation: generating counter-arguments to hate speech, like https://github.com/marcoguerini/CONAN
We can explore any ideas for multilingual models, models/datasets creation specifically for language(s) you are speaking.
Contact
daryna.dementieva@tum.de
How to apply:
Please, include your CV and transcript together with the motivation on doing research on a specific topic, shortly about previous experience in ML/DL/NLP (if any).
____
Debiasing Hate Speech Detection Tasks through LLM-Powered Data Augmentation
Type
Master’s Thesis / Guided Research
Requirements
- Experience with prompt engineering
- Experience with transformers, LLMs, and data processing
- Introductory knowledge of machine learning and natural language processing
Description
The rise of hate speech on social media platforms poses significant challenges to creating equitable and effective detection systems. Existing datasets often suffer from class imbalance and biases, limiting the performance of machine learning models. This research proposes leveraging LLMs, such as GPT-4, to generate synthetic data for balancing and debiasing hate speech datasets. By using techniques like controlled generation, context-aware prompting, and fine-tuning, the study aims to develop a robust framework that improves classification accuracy and fairness across datasets and cultural contexts.
Contact
Faeze Ghorbanpour (firstname.lastname@tum.de)
References
- Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection
- A Target-Aware Analysis of Data Augmentation for Hate Speech Detection
- Reducing Target Group Bias in Hate Speech Detectors
- Mitigating Biases in Hate Speech Detection from A Causal Perspective
- Don’t Augment, Rewrite? Assessing Abusive Language Detection with Synthetic Data
For Master’s Thesis / Guided Research: when applying, please select one of the recent and relevant papers (a few mentioned above). Write one or two paragraphs summarizing the paper, including its key findings, limitations, and how your proposed research can address these gaps.
____
Simulating Human Annotation for Bias and Hate Speech Detection Using LLMs
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Requirements
- Experience with prompt engineering
- Experience with transformers, LLMs, and data processing
- Introductory knowledge of machine learning and natural language processing
Descriptions
This research explores whether Large Language Models (LLMs) can replicate human annotation patterns by incorporating annotator-specific characteristics. Given that demographic and ideological backgrounds influence how individuals label subjective tasks like hate speech and media bias, this study examines the extent to which LLMs can generate labels that align with human annotators. The findings will contribute to understanding the feasibility of AI-assisted annotation, potential biases in automated labeling, and the broader implications of using LLMs in subjective classification tasks.
Contact
Faeze Ghorbanpour (firstname.lastname@tum.de)
Reference Papers
- The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
- Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets
- Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
- When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset
- LLMs left, right, and center: Assessing GPT's capabilities to label political bias from web domains
For Master’s Thesis / Guided Research: When applying, please select one of the recent and relevant papers (a few mentioned above). Write one or two paragraphs summarizing the paper, including its key findings, limitations, and how your proposed research can address these gaps.
____
Evaluating Multilingual LLMs on Persian Language Tasks: Literature, Math, or Riddles
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Requirements:
- Familiarity with Persian would be pleasant.
- Experience with prompt engineering
- Familiarity with transformers, LLMs, and data processing
- Introductory knowledge of machine learning and natural language processing
- (For math) Basic knowledge of mathematical reasoning
- (For riddles) Basic knowledge of web scraping
Description:
This project investigates how multilingual large language models (LLMs) handle Persian language tasks across three domains: literature (history and poetry), mathematics, and riddles/puzzles (text and image). The goal is to test LLMs' performance in Farsi and English-translated versions (for math and riddles) and evaluate different prompting strategies. Given the underrepresentation of Persian in major NLP benchmarks, this study will reveal potential weaknesses in current models and propose improvements for better multilingual understanding.
You can choose one of the following areas for your research: Persian Literature, Mathematics, Riddles and Puzzles
Contact:
Faeze Ghorbanpour (firstname.lastname@tum.de)
Reference Papers
- Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
- Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT
- INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
- Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages
- Chain-of-Dictionary Prompting Elicits Translation in Large Language Models