Research ▪ sancl

Linguistic Probing of Language Models

Do language models understand structural & theoretical knowledge of language? Language models are known to be aware of structural linguistic knowledge and can process such liguistic things with specific parts(layers or neurons) of them. I’m collaborating with syntaticians (and/or) semanticists, dealing with some linguistic phenomena which seems not likely to be “understood” by language models.

Semoon Hoe and Sangah Lee (2024), A Short Note on the Structural Priming in LLM: Focusing on Dative Constructions in Korean, Language and Information, Vol.28, No.3, pp.111-142. (In Korean)

(Human-like) Reasoning Abilities of LLMs

I’m interested about the intermediate reasoning steps that LLMs produce while solving problems - espeically linguistic or cognitive ones. We can obtain their rationales in the form of natural language, and assess the rationales in various points of view. Maybe stuffs from pedagogy or language acquisition can help too.

Seung Joo Yoo and Sangah Lee (2024), Large Language Models Show Human-Like Abstract Thinking Patterns: A Construal-Level Perspective, Proceedings of the Annual Meeting of the Cognitive Science Society.

Dealing with Low-Resource Language(s)

Yes, you like Manchu. Not only Manchu language, there are so many things to do with non-English languages, especially low-resourced ones, in NLP. Especially I’m interested in tokenizers for languages which are not written in Latin alphabets or which are highly agglutinative (yes, Korean!).

Jean Seo, Minha Kang, SungJoo Byun, and Sangah Lee (2024), ManWav: The First Manchu ASR Model, Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024).
Sangah Lee, Sungjoo Byun, Jean Seo, and Minha Kang (2024), ManNER & ManPOS: Pioneering NLP for Endangered Manchu Language, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
Jean Seo, Sungjoo Byun, Minha Kang, and Sangah Lee (2023), Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data, 3rd Multilingual Represenation Learning (MRL) Workshop.
Sangah Lee and Hyopil Shin (2021), The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts, 2021 The 7th Workshop on Noisy User-Generated Text (W-NUT).

Argument Mining

What do people think about a controversial topic? That’s a kind of argumentative data, related to the topic “Argument Mining.” Particularly I’d like to collect and summarize diverse evidences that people propse when they support or attack a stance about the topic. It will require many steps: analyzing the argumentative structure of given texts, identifying necessary arguments or evidences, summarizing or clustering those evidences, and so on.

Sangah Lee and Hyopil Shin (2021), Argument Facet Detection in Online Debates Based on Attention Weights and Clustering with Combined Similarity Matrices, Korean Journal of Linguistics, Vol.46, No.1, pp.107-134.
Sangah Lee and Hyopil Shin (2018), An Analysis of Linear Argumentation Structure of Korean Debate Texts Using Sequential Modeling and Linguistic Features, Journal of KIISE, Vol.45, No.12, pp.1292-1301. (In Korean)
Sangah Lee and Hyopil Shin (2016), Stance Classification of Online Debate Texts based on Discourse Relations, Language Research, Vol.52, No.3, pp.511-532. (In Korean)