Open Topics
Last update: 04/03/2026
We offer multiple Bachelor/Master theses, Guided Research projects, and IDPs in the area of natural language processing.
These are the primary research directions our supervisors are currently focusing on. For more specific topics, please refer to the information below (alphabetical order).
- Daryna Dementieva: NLP for Social Good, fake news detection, hate and toxic speech proactive mitigation, fake news detection, multilingual NLP, explainable NLP
- Marion Di Marco: linguistic information in LLMs, translation, morphology, subword segmentation
- Lukas Edman: tokenization, low-resource pretraining of LLMs, character-level models, machine translation
- Faeze Ghorbanpour: efficient transfer learning, low-resource NLP, multilingual NLP, applied NLP, harmful content detection, computational social science.
- Shu Okabe: NLP for very low-resource languages, parallel sentence mining, linguistic annotation generation, word and morphological segmentation, machine translation
A non-exhaustive list of open topics is listed below, together with a potential supervisor. Please contact potential supervisors directly.
NEW: we are gradually moving to https://thesis.aet.cit.tum.de for thesis management. Please have a look at our up-to-date topics there (research group: Data Analytics & Statistics; direct link: https://thesis.aet.cit.tum.de/?groups=a7aa8118-3b10-4b75-ac84-97ed0551fd4a).
How to apply:
Please, always make sure to include in your application:
- Your CV;
- Your transcript;
- Motivation to work on a topic(s) of your interest;
- Cross-check any additional requirements from the TA you are contacting.
Low-Resource Pretraining
Type
Master Thesis / Guided Research
Prerequisites
- Experience with programming in Python, using PyTorch and HuggingFace libraries
- Understanding of data processing
- Understanding of BERT-style and GPT-style pretraining
- Creativity (come with ideas)
Description
Humans from ages 0-13 see about 100 million words. Meanwhile we are training large language models with trillions of words. How can we train language models to be more efficient? This project will be orbiting the BabyLM Challenge, a yearly competition to train the best language model on 10 or 100 million words. There are several potential avenues for improvement, including: curating a better dataset, changing the training scheme, changing the tokenization, etc. There is also a multimodal track, which involves encoding both images and text. And while this has been in English only so far, it is also possible to investigate this for other languages (a multilingual track is coming soon!).
Contact
References
- BabyLM Challenge
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805
- Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- Findings of the Second BabyLM Challenge
https://aclanthology.org/2024.conll-babylm.1/
- Too Much Information: Keeping Training Simple for BabyLMs
https://aclanthology.org/2023.conll-babylm.8.pdf
___
Parallel sentence mining for low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
- Basic understanding of machine learning and natural language processing
- Optional: interest in working on low-resource languages
Description
Parallel sentence mining aims to find translation pairs among two monolingual corpora (e.g., Wikipedia articles on the same topic but in different languages). This task constitutes a crucial step towards developing a machine translation system notably for low-resource languages, where parallel corpora are scarce.
Contact
Shu Okabe (first.last@tum.de)
References
- Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora (https://aclanthology.org/W17-2512.pdf)
- Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation (https://aclanthology.org/P19-1118.pdf)
- Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (https://aclanthology.org/2022.findings-emnlp.154.pdf)
- Boosting Unsupervised Machine Translation with Pseudo-Parallel Data (https://aclanthology.org/2023.mtsummit-research.12.pdf)
Morphological analysis for low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Machine learning and NLP knowledge
- Proficiency with Python
- Interest in statistical models or deep learning frameworks (TensorFlow or PyTorch)
- Optional: interest in working on low-resource languages
Description
Morphology aims to study how words are built, namely with morphemes, the smallest meaningful units in a language (e.g., for 'walked': the lemma 'walk' and suffix 'ed'). Different NLP tasks are possible to analyse a language's morphology: the most straightforward approach is to find the morphological tag for an inflected word (e.g., finding walk and V;PST (meaning a verb (V) in the past (PST) tense) from walked). The 'reverse' task is to inflect a word based on its lemma and morphological tag (e.g., getting walked from walk and V;PST).
Two other tasks are related to such an analysis: the prediction of morpheme boundaries, or morphological segmentation (e.g., getting walk-ed from walked), and the prediction of linguistic annotations called interlinear glosses, which combine the lemma with grammatical features (e.g., walk-PST from walked).
For all tasks, we will focus on low-resource languages.
Contact
Shu Okabe (first.last@tum.de)
References
- Morphological analysis: The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection (https://aclanthology.org/W19-4226v3.pdf; task 2)
- Morphological inflection: SIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological Inflection (https://aclanthology.org/2023.sigmorphon-1.13.pdf)
- Morphological segmentation: The SIGMORPHON 2022 Shared Task on Morpheme Segmentation (https://aclanthology.org/2022.sigmorphon-1.11.pdf)
- Interlinear gloss generation: Shared task on automatic glossing: Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing (https://aclanthology.org/2023.sigmorphon-1.20/)
—
Machine Translation (MT) for (very) low-resource languages
Type
Master's thesis
Prerequisites
- Machine learning and NLP knowledge
- Proficiency with Python
- Interest in deep learning frameworks (TensorFlow or PyTorch)
- Optional: interest in working on low-resource languages, knowledge of neural MT frameworks (e.g., fairseq)
Description
For low-resource languages (and hence language pairs), Machine Translation (MT) remains a challenge due to the scarcity of available datasets for training in comparison to high-resource pairs, such as German–English. The NLP community has already focused on low-resource languages of Spain [1] and Indic languages [3] with state-of-the-art approaches. When available, linguistic annotations can also be leveraged to perform translation [2].
We will focus on the Upper/Lower Sorbian-German pairs [4], unless you have a specific language pair in mind (with a limited size of available data or documented poor MT performance).
Contact
Shu Okabe (first.last@tum.de)
References
- Findings of the WMT 2024 Shared Task Translation into Low-Resource Languages of Spain: Blending Rule-Based and Neural Systems (https://aclanthology.org/2024.wmt-1.57v2.pdf)
- GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning (https://aclanthology.org/2025.acl-long.1447.pdf)
- Findings of WMT 2025 Shared Task on Low-resource Indic Languages Translation (https://aclanthology.org/2025.wmt-1.29.pdf)
- Findings of the WMT 2025 Shared Task LLMs with Limited Resources for Slavic Languages: MT and QA (https://aclanthology.org/2025.wmt-1.27.pdf)
---
In-context learning for translating low-resourced languages
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites:
- Some background in NLP
- Experience with programming in Python
- Interest in linguistics
Description
Large Language Models are trained on large amounts of unstructured data; they gain remarkable abilities to solve various linguistic tasks. However, many languages are not or only insufficiently covered in LLMs. In the absence of (large amounts of) pre-training data, descriptions of linguistic structure can provide the model with relevant information about a language unknown to an LLM, and thus improve the model's abilities to solve a task like translating from such a language.
The thesis aims at exploring the in-context learning abilities of pre-trained LLMs by making use of different types of linguistic resources, such as morpho-syntactic analysis, dictionary entries and small collections of parallel data. When instructing the LLM to translate, information relevant to the sentence to be translated is identified in a prior step, transformed into a linguistic description and presented as part of the prompt.
Contact
marion.dimarco@tum.de
References
- Court et al. (2024): Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem. https://aclanthology.org/2024.wmt-1.125.pdf
- Zhang et al. (2024):Teaching Large Language Models an Unseen Language on the Fly. https://aclanthology.org/2024.findings-acl.519/
___
Multilingual and Cross-lingual Proactive Abusive Speech Mitigation
Type
Bachelor Thesis / Master Thesis / Guided Research
Requirements
- Experience with programming in Python
- Experience in Pytorch
- Introductory knowledge of Transformers and HuggingFace
Description
Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. Thus, the development of automatic abusive speech mitigation models, especially, in robust multilingual or cross-lingual format is still an open research question. The steps of development of automatic mitigation solutions and making them proactive include:
- Robust multilingual and cross-lingual abusive/toxic/hate speech detection;
- ie develop good multilingual classifier like: https://huggingface.co/textdetox/xlmr-large-toxicity-classifier
- Text detoxification: text style transfer from toxic to non-toxic: like https://pan.webis.de/clef24/pan24-web/text-detoxification.html
- Counter speech generation: generating counter-arguments to hate speech, like https://github.com/marcoguerini/CONAN
We can explore any ideas for multilingual models, models/datasets creation specifically for language(s) you are speaking.
Contact
daryna.dementieva@tum.de
How to apply:
Please, include your CV and transcript together with the motivation on doing research on a specific topic, shortly about previous experience in ML/DL/NLP (if any).