Projects

Open Topics

Last update: 13/06/2025

We offer multiple Bachelor/Master theses, Guided Research projects, and IDPs in the area of natural language processing.

These are the primary research directions our supervisors are currently focusing on. For more specific topics, please refer to the information below (alphabetical order).

Daryna Dementieva: NLP for Social Good, fake news detection, hate and toxic speech proactive mitigation, fake news detection, multilingual NLP, explainable NLP
Marion Di Marco: linguistic information in LLMs, translation, morphology, subword segmentation
Lukas Edman: tokenization, low-resource pretraining of LLMs, character-level models, machine translation
Faeze Ghorbanpour: efficient transfer learning, low-resource NLP, multilingual NLP, applied NLP, harmful content detection, computational social science.
Kathy Hämmerl: multilingual NLP, cross-lingual alignment, cultural differences, machine translation, evaluation
Wen (Lavine) Lai: machine translation, multilingual NLP, vision-text alignment, hallucination
Shu Okabe: NLP for very low-resource languages, parallel sentence mining, linguistic annotation generation, word and morphological segmentation, machine translation

A non-exhaustive list of open topics is listed below, together with a potential supervisor. Please contact potential supervisors directly.

How to apply:

Please, always make sure to include in your application:

Your CV;
Your transcript;
Motivation to work on a topic(s) of your interest;
Cross-check any additional requirements from the TA you are contacting.

Low-Resource Pretraining

Type

Bachelor Thesis / Master Thesis / Guided Research

Prerequisites

Experience with programming in Python, using PyTorch and HuggingFace libraries
Understanding of data processing
Understanding of BERT-style and GPT-style pretraining
Creativity (come with ideas)

Description

Humans from ages 0-13 see about 100 million words. Meanwhile we are training large language models with trillions of words. How can we train language models to be more efficient? This project will be orbiting the BabyLM Challenge, a yearly competition to train the best language model on 10 or 100 million words. There are several potential avenues for improvement, including: curating a better dataset, changing the training scheme, changing the tokenization, etc. There is also a multimodal track, which involves encoding both images and text. And while this has been in English only so far, it is also possible to investigate this for other languages (a multilingual track is coming soon!).

Contact

Lukas Edman

References

BabyLM Challenge

https://babylm.github.io/

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

https://arxiv.org/abs/1810.04805

Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Findings of the Second BabyLM Challenge

https://aclanthology.org/2024.conll-babylm.1/

Too Much Information: Keeping Training Simple for BabyLMs

https://aclanthology.org/2023.conll-babylm.8.pdf

___

Parallel sentence mining for low-resource languages

Type

Master's thesis / Bachelor's thesis

Prerequisites

Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
Basic understanding of machine learning and natural language processing
Optional: interest in working on low-resource languages

Description

Parallel sentence mining aims to find translation pairs among two monolingual corpora (e.g., Wikipedia articles on the same topic but in different languages). This task constitutes a crucial step towards developing a machine translation system notably for low-resource languages, where parallel corpora are scarce.

Contact

Shu Okabe (first.last@tum.de)

References

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora (https://aclanthology.org/W17-2512.pdf)
Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation (https://aclanthology.org/P19-1118.pdf)
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (https://aclanthology.org/2022.findings-emnlp.154.pdf)
Boosting Unsupervised Machine Translation with Pseudo-Parallel Data (https://aclanthology.org/2023.mtsummit-research.12.pdf)

Morphological analysis for low-resource languages

Type

Master's thesis / Bachelor's thesis

Prerequisites

Machine learning and NLP knowledge
Proficiency with Python
Interest in statistical models or deep learning frameworks (TensorFlow or PyTorch)
Optional: interest in working on low-resource languages

Description

Morphology aims to study how words are built, namely with morphemes, the smallest meaningful units in a language (e.g., for 'walked': the lemma 'walk' and suffix 'ed'). Different NLP tasks are possible to analyse a language's morphology: the most straightforward approach is to find the morphological tag for an inflected word (e.g., finding walk and V;PST (meaning a verb (V) in the past (PST) tense) from walked). The 'reverse' task is to inflect a word based on its lemma and morphological tag (e.g., getting walked from walk and V;PST).

Two other tasks are related to such an analysis: the prediction of morpheme boundaries, or morphological segmentation (e.g., getting walk-ed from walked), and the prediction of linguistic annotations called interlinear glosses, which combine the lemma with grammatical features (e.g., walk-PST from walked).

For all tasks, we will focus on low-resource languages.

Contact

Shu Okabe (first.last@tum.de)

References

Morphological analysis: The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection (https://aclanthology.org/W19-4226v3.pdf; task 2)
Morphological inflection: SIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological Inflection (https://aclanthology.org/2023.sigmorphon-1.13.pdf)
Morphological segmentation: The SIGMORPHON 2022 Shared Task on Morpheme Segmentation (https://aclanthology.org/2022.sigmorphon-1.11.pdf)
Interlinear gloss generation: Shared task on automatic glossing: Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing (https://aclanthology.org/2023.sigmorphon-1.20/)

___

Multilingual Hallucination: Detection, Evaluation and Mitigation

Type

Master’s Thesis / Bachelor’s Thesis / Guided Research

Prerequisites

Enthusiasm (for publishing results at a conference/workshop)
Proficiency in speaking and writing English
Good Python programming background (e.g., knowledge of numpy, pandas, and Sklearn libraries)
Basic knowledge of ML/NLP (e.g., understanding how a classifier works, knowledge of transformer architecture)
Basic command of PyTorch and Transformers libraries is recommended

Description

Large language models (LLMs) have made remarkable strides in multilingual natural language processing (NLP), however they remain prone to hallucinations—generating factually incorrect or misleading information. This issue is particularly severe in low-resource languages, where limited training data exacerbates factual inconsistencies. We are now working on three key objectives for multilingual hallucination. (1) designing a robust detection pipeline capable of identifying hallucinations across multiple languages, (2) constructing a high-quality benchmark dataset to evaluate hallucination detection methods, and (3) developing novel mitigation strategies that integrate structured multilingual knowledge into model training and inference.

Contact

Wen Lai (wen.lai [at] tum.de)

References

Hallucination Detection: A Probabilistic Framework Using Embeddings Distance Analysis (https://arxiv.org/abs/2502.08663)
SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models (https://arxiv.org/abs/2502.01812)
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation (https://arxiv.org/abs/2410.12222)
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework (https://arxiv.org/abs/2407.10793)
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation (https://arxiv.org/abs/2502.11306)
Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models (https://arxiv.org/abs/2502.05825)

—

In-context learning for translating low-resourced languages

Type

Master’s Thesis / Bachelor’s Thesis / Guided Research

Prerequisites:

Some background in NLP
Experience with programming in Python
Interest in linguistics

Description

Large Language Models are trained on large amounts of unstructured data; they gain remarkable abilities to solve various linguistic tasks. However, many languages are not or only insufficiently covered in LLMs. In the absence of (large amounts of) pre-training data, descriptions of linguistic structure can provide the model with relevant information about a language unknown to an LLM, and thus improve the model's abilities to solve a task like translating from such a language.

The thesis aims at exploring the in-context learning abilities of pre-trained LLMs by making use of different types of linguistic resources, such as morpho-syntactic analysis, dictionary entries and small collections of parallel data. When instructing the LLM to translate, information relevant to the sentence to be translated is identified in a prior step, transformed into a linguistic description and presented as part of the prompt.

Contact

marion.dimarco@tum.de

References

Court et al. (2024): Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem. https://aclanthology.org/2024.wmt-1.125.pdf
Zhang et al. (2024):Teaching Large Language Models an Unseen Language on the Fly. https://aclanthology.org/2024.findings-acl.519/

___

Multilingual and Cross-lingual Proactive Abusive Speech Mitigation

Type

Bachelor Thesis / Master Thesis / Guided Research

Requirements

Experience with programming in Python
Experience in Pytorch
Introductory knowledge of Transformers and HuggingFace

Description

Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. Thus, the development of automatic abusive speech mitigation models, especially, in robust multilingual or cross-lingual format is still an open research question. The steps of development of automatic mitigation solutions and making them proactive include:

Robust multilingual and cross-lingual abusive/toxic/hate speech detection;
- ie develop good multilingual classifier like: https://huggingface.co/textdetox/xlmr-large-toxicity-classifier
Text detoxification: text style transfer from toxic to non-toxic: like https://pan.webis.de/clef24/pan24-web/text-detoxification.html
Counter speech generation: generating counter-arguments to hate speech, like https://github.com/marcoguerini/CONAN

We can explore any ideas for multilingual models, models/datasets creation specifically for language(s) you are speaking.

Contact

daryna.dementieva@tum.de

How to apply:

Please, include your CV and transcript together with the motivation on doing research on a specific topic, shortly about previous experience in ML/DL/NLP (if any).

____

Generalizable Harmful Content Detection

Type

Master’s Thesis / Guided Research

Contact:

Faeze Ghorbanpour (firstname.lastname@tum.de)

Requirements:

Basic understanding of deep learning and NLP
Familiarity with Python and PyTorch
Experience working with transformer models
Experience with prompt engineering

Description

Detecting harmful content—such as hate speech, toxicity, and abuse—across languages, cultures, and online platforms is critical for ensuring safe and inclusive digital spaces. However, most existing models perform well only within the narrow scope of the datasets or domains they were trained on, lacking robustness and adaptability to new or unseen contexts. This is due to the subjectivity of harm, sociolinguistic variation, and the high annotation cost for diverse languages and communities. As harmful language evolves and manifests differently across platforms and groups, the future demands modular, adaptive, and context-aware approaches that can generalize beyond fixed definitions and datasets. The following emerging methods offer promising directions toward this goal:

Specialized Agents (EMNLP 2024): Use decentralized agents, each expert in specific content types or domains, to improve collective detection accuracy and robustness.
Causal Inference Prompting (Findings EMNLP 2024): Leverage causal reasoning in prompts to distinguish harmful effects from superficial correlations and biases.
Sociodemographic Prompting (NAACL 2025): Inject user sociodemographic context into prompts to improve cultural and community-level harm interpretation.
Synthetic Data (EMNLP 2024, Finding ACL 2024): leveraging LLMs to generate synthetic data for generalization to improve accuracy and fairness across datasets and cultural contexts.
Adapters (EACL 2023, ICON 2024): Train and fuse task- or domain-specific adapters to enable parameter-efficient generalization across content types or domains.

When applying, please indicate your preferred sub-topic or any other methods you are interested in exploring further. Briefly explain what interests you about this choice and how you believe you can contribute to advancing it. You are encouraged to take a look at the related papers before applying.

____

Evaluating Multilingual LLMs on Persian Language Tasks: Literature, Math, or Riddles

Type

Master’s Thesis / Bachelor’s Thesis / Guided Research

Requirements:

Familiarity with Persian would be pleasant.
Experience with prompt engineering
Familiarity with transformers, LLMs, and data processing
Introductory knowledge of machine learning and natural language processing
(For math) Basic knowledge of mathematical reasoning
(For riddles) Basic knowledge of web scraping

Description:

This project investigates how multilingual large language models (LLMs) handle Persian language tasks across three domains: literature (history and poetry), mathematics, and riddles/puzzles (text and image). The goal is to test LLMs' performance in Farsi and English-translated versions (for math and riddles) and evaluate different prompting strategies. Given the underrepresentation of Persian in major NLP benchmarks, this study will reveal potential weaknesses in current models and propose improvements for better multilingual understanding.

You can choose one of the following areas for your research: Persian Literature, Mathematics, Riddles and Puzzles

Contact:

Faeze Ghorbanpour (firstname.lastname@tum.de)

Reference Papers

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models

____

Robust Language Identification for Heterogeneous Textual Collections

Type

Master's thesis / Bachelor's thesis

Prerequisites

Machine learning and NLP knowledge
Proficiency with Python
Interest in statistical models or deep learning frameworks (PyTorch)
Optional: interest in working on under-represented languages

Description

The lack of training data —especially high-quality data— has been the root cause of poor model language performance for many languages. One obstacle to improving and obtaining high-quality textual data is language identification (LangID or LID), as this has been a crucial step in extracting relevant pre-training data for language modeling from large collections of text. However, LangID remains far from solved for many languages and several of the commonly used models were introduced in 2017, such as fastText and CLD3. Moreover, the simple architectures used in LangID models today, are very prone to being biased towards their training data, making them ineffective at finding and properly classifying text in large, heterogeneous and noisy sources, such as web data.

The task will be to explore new architectures introduced after 2017, and use them to develop more robust and more general LangID solutions. There will also be a focus on low-resource and underrepresented languages.

Contact

Pedro Ortiz Suarez (pedro@commoncrawl.org); Shu Okabe (first.last@tum.de)

References

Amir Hossein Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schuetze. 2023. GlotLID: Language Identification for Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155–6218, Singapore. Association for Computational Linguistics.
Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. 2023. An Open Dataset and Model for Language Identification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.
Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. 2020. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, and Mofetoluwa Adeyemi. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72.

To top

Lehrstuhl für Data Analytics and Statistics

Prof. Alexander Fraser