Open Topics
We offer multiple Bachelor/Master theses, Guided Research projects, and IDPs in the area of natural language processing.
A non-exhaustive list of open topics is listed below, together with a potential supervisor. Please contact potential supervisors directly.
___
Translation of Low-Resource Languages and Dialects
Type
Bachelor Thesis / Master Thesis / Guided Research
Requirements
- Experience with programming in Python
- Basic understanding of data processing
- Basic understanding of machine learning and machine translation
- (Preferred) Basic understanding of the low-resource language chosen
Description
Machine translation is near human-level performance for high-resource languages, but when it comes to low-resource languages and dialects, there is still a lot of room for improvement. The goal of this project is to work on a low-resource language or dialect of your choosing and to establish or improve on the state-of-the-art for translating to and from another language. There are several ways you can go about this, such as gathering data and constructing a new high-quality dataset using alignment methods, or transferring knowledge from a related high-resource language, among many others.
Contact
References
- (Haddow et al. 2022) https://aclanthology.org/2022.cl-3.6.pdf
- (NLLB Team et al. 2022) https://arxiv.org/pdf/2207.04672
- (Zhu et al. 2023) https://arxiv.org/pdf/2304.04675
- (Her and Kruschwitz 2024) https://arxiv.org/pdf/2404.08259
___
Title: Linguistic gloss generation for low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
- Basic understanding of machine learning and natural language processing
- Optional: interest in working on low-resource languages
Description
Linguistic glosses (or interlinear glosses) are linguistic annotations that are created to express the meaning and grammatical phenomena in a source language (e.g., 'Hund-e' in German would be annotated as 'dog-PLURAL' in English). These annotations are costly to obtain and scarce; the aim of gloss generation is hence to automatically predict glosses, especially for low-resource languages. The main challenges come from the amount of training data and the lexical diversity in sentences. Glosses may also help machine translation (cf. [4]) since they can act as a bridge between the source and target languages.
Contact
Shu Okabe (first.last@tum.de)
References
- Statistical gloss generation: Automating Gloss Generation in Interlinear Glossed Text (https://aclanthology.org/2020.scil-1.42/)
- Neural gloss generation: Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations (https://aclanthology.org/2020.coling-main.471/)
- Shared task on automatic glossing: Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing (https://aclanthology.org/2023.sigmorphon-1.20/)
- Glosses as bridge for machine translation: Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation (https://arxiv.org/abs/1911.02709)
Title: Parallel sentence mining for low-resource languages
Type
Master's thesis / Bachelor's thesis
Prerequisites
- Experience with Python and deep learning frameworks (TensorFlow or PyTorch)
- Basic understanding of machine learning and natural language processing
- Optional: interest in working on low-resource languages
Description
Parallel sentence mining aims to find translation pairs among two monolingual corpora (e.g., Wikipedia articles on the same topic but in different languages). This task constitutes a crucial step towards developing a machine translation system notably for low-resource languages, where parallel corpora are scarce.
Contact
Shu Okabe (first.last@tum.de)
References
- Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora (https://aclanthology.org/W17-2512.pdf)
- Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation (https://aclanthology.org/P19-1118.pdf)
- Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (https://aclanthology.org/2022.findings-emnlp.154.pdf)
- Boosting Unsupervised Machine Translation with Pseudo-Parallel Data (https://aclanthology.org/2023.mtsummit-research.12.pdf)
—
Cross-Cultural NLP
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites
- Machine learning knowledge
- Proficiency with Python and deep learning frameworks (PyTorch, TensorFlow)
- Knowledge of generative language models
Description
One of our core interests in the group are multilingual language models. However, there is a serious concern that these models are dominated by English-language data and particularly American cultural norms.
In this project, you would evaluate open models on CulturalBench¹ or similar datasets, and explore how they can become better at such tasks. Has the model never seen data relevant to a specific question, or is information from another culture simply overriding this information? This project can include finding additional useful data and fine-tuning a model, but also quicker modifications such as few-shot learning and prompt engineering.
Contact
Kathy Hämmerl (haemmerl [at] cis.lmu.de)
References
[1] CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge (https://arxiv.org/pdf/2404.06664)
[2] Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art (https://arxiv.org/abs/2406.03930)
[3] Speaking Multiple Languages Affects the Moral Bias of Language Models (https://aclanthology.org/2023.findings-acl.134/)
[4] NLPositionality: Characterizing Design Biases of Datasets and Models (https://aclanthology.org/2023.acl-long.505/)
—
Evaluation in Text Style Transfer
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites
● Enthusiasm (for publishing results at a conference/workshop)
● Proficiency in speaking and writing English
● Good Python programming background (e.g., knowledge of numpy, pandas, sklearn libraries)
● Basic knowledge of ML/NLP (e.g., understanding how a classifier works, knowledge of transformer architecture)
● Basic command of PyTorch and Transformers libraries is recommended
Description
Text style transfer aims to effectively rewrite the source style text into the target style text without changing the meaning and fluency. For example, rewriting the sentence "Sorry about that" from an informal style to "I apologize for the inconvenience caused" in a formal style can be challenging. This difficulty arises because, although the two sentences are semantically consistent, their consistency is not conveyed through the exact words used, but rather through an underlying semantic space. For bachelor students, it is essential to understand the evaluation metrics in the TST task and conduct a detailed analysis of the current metrics in a low-resource language scenario. For further details, please refer to Babakov et al., (2022) and Ostheimer et al., (2024). For master students, a deeper understanding of the principles of TST evaluation is required, along with an attempt to propose an innovative evaluation method similar to BERTScore (Zhang et al., 2019).
Contact
Wen Lai (wen.lai [at] tum.de)
References
[1] A large-scale computational study of content preservation measures for text style transfer and paraphrase generation (https://aclanthology.org/2022.acl-srw.23)
[2] Text Style Transfer Evaluation Using Large Language Models (https://aclanthology.org/2024.lrec-main.1373)
[3] Bertscore: Evaluating text generation with bert (https://arxiv.org/abs/1904.09675)
[4] Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs (https://arxiv.org/abs/2406.04460)
—
Alignment Between Vision and Text
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Prerequisites
● Enthusiasm (for publishing results at a conference/workshop)
● Proficiency in speaking and writing English
● Good Python programming background (e.g., knowledge of numpy, pandas, sklearn libraries)
● Basic knowledge of ML/NLP (e.g., understanding how a classifier works, knowledge of transformer architecture)
● Basic command of PyTorch and Transformers libraries is recommended
Description
Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which may cause problems such as hallucination. How to improve the alignment between vision and text is a challenging task and has become a hot topic. In general, we focus on the following three research questions: (i) What causes misalignment between vision and text? (ii) How does this misalignment affect the outputs? (iii) How to improve the alignment between these two modalities? For bachelor students, we focus on the research question (i) and (ii) to analyse the reason and behavior of the misalignment. For master students, a deeper understanding of LVLMs is required, along with an attempt to propose an innovative approach to mitigate the alignment.
Contact
Wen Lai (wen.lai [at] tum.de)
References
[1] Locality Alignment Improves Vision-Language Models (https://arxiv.org/abs/2410.11087)
[2] Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement (https://arxiv.org/abs/2405.15973)
[3] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models (https://arxiv.org/abs/2406.02915)
[4] Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts (https://arxiv.org/abs/2111.08276)
[5] LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models (https://arxiv.org/abs/2404.03118)
—
Segmentation and Morphological Features in Machine Translation
Bachelor Thesis / Master Thesis / Guided Research
Requirements
- Experience with programming in Python
- Basic understanding of data processing
- Basic understanding of machine learning and machine translation
- Interest in linguistics and ideally some knowledge of the language(s)
Description
Morphologically rich languages like Finnish or Turkish condense much information within one word. This leads to data sparsity problems, as a high number of inflected forms is only insufficiently covered in the training data. Word segmentation approaches such as BPE do not optimally capture morphological patterns. An alternative approach is linguistically guided word segmentation.
The thesis topic consists in exploring segmentation approaches for (low-resourced) morphologically rich languages, in combination with the integration of relevant morpho-syntactic information in an NMT scenario.
Contact
Marion Di Marco (marion.dimarco [at] tum.de)
References
- Mager et al. (2022) https://aclanthology.org/2022.findings-acl.78
- Banerjee et al. (2018) aclanthology.org/W18-1207/
- Sälevä et al. (2022) aclanthology.org/2021.eacl-srw.22.pdf
___
Multilingual and Cross-lingual Text Detoxification
Type
Bachelor Thesis / Master Thesis / Guided Research
Requirements
- Experience with programming in Python
- Experience in Pytorch
- Introductory knowledge of Transformers and HuggingFace
Description
One of the domains of text style transfer approaches applications is transfer of style of texts from toxic to non-toxic. Currently, the training parallel data is available for 9 languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, and Amharic. However, there is still a little investigation of cross-lingual text detoxification knowledge possibilities between languages. This project aims to explore how much data and data in which language is needed to obtain a text detoxification model for the target language, which models/modules are the best to achieve this, if LLMs can solve everything already.
Contact
daryna.dementieva@tum.de
References
- Data from https://pan.webis.de/clef24/pan24-web/text-detoxification.html
- ParaDetox Datasets and Models for English https://aclanthology.org/2022.acl-long.469/
- Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification https://aclanthology.org/2023.ijcnlp-main.70.pdf
____
Nationality Bias in Multilingual LLMs: The Impact of Names on Model Responses
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Contact:
Faeze Ghorbanpour (firstname.lastname@tum.de)
Requirements:
- Experience with prompt engineering
- Experience with transformers, LLMs, and data processing
- Introductory knowledge of machine learning and natural language processing
Description
This project explores how language models may treat prompts differently based on names that suggest specific nationalities. With LLMs being widely used for daily tasks, it's crucial to understand if and how these models show bias when processing names from different cultures. The project tests this by prompting models with identical contexts but changing the names to reflect various nationalities, and then analyzing any bias in the responses. This method helps reveal hidden biases and could lead to strategies for more fair and balanced AI systems.
Reference
- Uncovering Name-Based Biases in Large Language Models Through Simulated Trust Game
- A Study of Nationality Bias in Names and Perplexity Using Off-the-Shelf Affect-related Tweet Classifiers
- John vs. Ahmed: Debate-Induced Bias in Multilingual LLMs
- Comparing Biases and the Impact of Multilingual Training across Multiple Languages
- What’s in a Name? Auditing Large Language Models for Race and Gender Bias
____
Evolving Hate Speech: A Temporal Analysis of Offensive Language Across Time
Type
Master’s Thesis / Bachelor’s Thesis / Guided Research
Requirements:
- Experience with data processing and analysis
- Experience with transformers, Pytorch, and LLMs
- Introductory knowledge of machine learning and natural language processing
Contact:
Faeze Ghorbanpour (firstname.lastname@tum.de)
Description
This research explores how hate speech evolves over time. By analyzing temporal shifts in offensive language, the study investigates whether older datasets and models remain effective in detecting newer forms of hate speech. It also examines the linguistic variations of hate speech across different times, highlighting the need for continual model updates and the challenges of applying static models to dynamic datasets. The study uses existing hate speech datasets spanning multiple years and platforms, providing insights into the temporal dynamics of online hate.
Reference
- A Systematic Analysis on the Temporal Generalization of Language Models in Social Media
- Hate begets Hate: A Temporal Study of Hate Speech
- Examining Temporal Bias in Abusive Language Detection
- Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation
- Leveraging time-dependent Lexical Features for Offensive Language Detection