https://github.com/tddschn/cmn-nan-translation-prompt-builder
This is a tool for building a detailed prompt to aid a Large Language Model (LLM) in translating Mandarin Chinese (cmn, 北平方言) to Hokkien (nan, 閩南語, Bân-lâm-gí). The core function is to take a Mandarin sentence and enrich it with dictionary definitions, creating an informative context for the translation (翻譯器, huan-i̍k-khì) task.
See screenshots and #Example for a demonstration of how it works.
https://gg.teddysc.me/?g=6c0d06999d1a05c0f425d122a2643ca6&a&c=4
This command builds this prompt
The translation has many mistakes as discussed below.
Help native / heritage speakers learn the writing of Hokkien.
Currently LLM will give 台灣優勢腔之音標, which may not be the same as the the dialect you speak, and this will likely confuse learners who are not exposed to different dialects of Hokkien.
The words used for the same thing can vary by regions too
LLMs with adequate written Hokkien knowledge are still very rare and not accessiable for the general public, this tool builds a prompt that help LLMs learn enough from the prompt to translate the sentence at hand with good enough quality. The Sutian's examples also help the LLM to choose between 文讀 or 白讀 if it needs to.
This project is a command-line tool designed to pre-process a sentence from the 北平方言 (Pak-pîng dialect) of Mandarin Chinese (ISO 639-3: cmn) and generate a detailed prompt for an LLM to translate it into Hokkien (ISO 639-3: nan).
The script automates the process of:
- Breaking down the input sentence into meaningful words.
- Looking up each word in an online Hokkien dictionary.
- Handling words not found in the dictionary by looking up their individual characters.
- Assembling all this information into a single, well-structured Markdown document.
The final Markdown output can then be passed to an LLM to perform a high-quality, context-aware translation.
The tool follows a multi-stage process to build the prompt:
- Text Conversion & Segmentation: The input text (in Traditional Chinese) is first converted to Simplified Chinese using
OpenCC. This is because thejiebasegmentation library is highly optimized for Simplified Chinese, resulting in more accurate word splits for compounds like颱風(typhoon) and習慣(habit). - Initial Dictionary Lookup: The segmented words (converted back to Traditional Chinese) are then used to query the 教育部臺灣閩南語常用詞辭典 (Sutian). The
download_preserve_path_to_dir_structure.pyscript downloads all dictionary pages in parallel to maximize speed. - Fallback Character Lookup: If a segmented word (e.g., "北平") does not yield a result from the dictionary, the script automatically triggers a second-stage lookup. It breaks the failed word into individual characters ("北", "平") and runs a new parallel download to fetch their definitions.
- Prompt Assembly: The script parses the downloaded HTML, converts the relevant dictionary entries to Markdown, and assembles the final document. The output is structured with the original input sentence followed by the dictionary results for each word, including any character-level fallback results nested underneath.
- Python 3.11+
- uv (for dependency management via script headers)
-
Clone the repository:
git clone https://github.com/tddschn/cmn-nan-translation-prompt-builder cd cmn-nan-translation-prompt-builder -
Move both scripts into a directory in your system's
$PATH, such as/usr/local/bin:sudo mv pak_penn_to_hokkien_split_and_sutian_prompt_builder.py /usr/local/bin/ sudo mv download_preserve_path_to_dir_structure.py /usr/local/bin/
You can provide input text directly as a command-line argument, from a file with -f, or via stdin.
The following example uses an artificially constructed sentence to demonstrate a wide range of vocabulary.
Several 倒裝詞 are used in this example.
Command:
pak_penn_to_hokkien_split_and_sutian_prompt_builder.py '你家後面有颱風經過 今天有客人要來一起吃午飯 很熱鬧 他們已經習慣了 他們說口渴想喝芭樂汁'Markdown Output (sent to stdout):
This command builds this prompt
Click to expand!
# Translation Pre-processing Document
## Original Input
> 你家後面有颱風經過 今天有客人要來一起吃午飯 很熱鬧 他們已經習慣了 他們說口渴想喝芭樂汁
---
## Dictionary Lookup Results
### 詞語查詢:「你家」
*(...Dictionary results for 你家...)*
---
### 詞語查詢:「後面」
*(...Dictionary results for 後面...)*
---
### 詞語查詢:「颱風」
*(...Dictionary results for 颱風...)*
---
### 詞語查詢:「習慣」
*(...Dictionary results for 習慣...)*
---
### 詞語查詢:「芭樂汁」
#### └─ 字元查詢:「芭」
*(...Dictionary results for the character 芭...)*
#### └─ 字元查詢:「樂」
*(...Dictionary results for the character 樂...)*
#### └─ 字元查詢:「汁」
*(...Dictionary results for the character 汁...)*
---
### LLM INSTRUCTION
Based on the original text and the provided dictionary lookups for each word, please translate the "Original Input" from Beijing Dialect (Mandarin) into Hokkien. Use the dictionary examples to ensure the translation is natural and accurate.Feeding the generated Markdown prompt into a capable LLM (I'm using the free Gemini 2.5 Pro model provided by Google AI Studio) could yield a high-quality translation like this:
Hokkien (Hanji): 恁兜後壁有風颱經過。今仔日有儂客欲來做伙食中晝,真鬧熱,𪜶已經習慣矣。𪜶講喙焦想欲啉菝仔汁。
Hokkien (Romanization): Lín tau āu-piah ū hong-thai king-kuè. Kin-á-ji̍t ū lâng-kheh beh lâi tsò-hué tsia̍h-tiong-tàu, tsin lāu-jia̍t, in í-king si̍p-kuàn--ah. In kóng tshuì-ta siūnn-beh lim pá-á-tsiap.
- The LLM output is mostly fine, but
guestshould be人客, it got that wrong . The tool is not good if you don't already know Hokkien. - Sutian has an entry for
習慣, but no one I know irl or online ever uses it, but LLM doesn't know that so it made another mistake.慣習is the correct one. lunch->中晝surprised me, and I think it would surprise a lot of people. 晝 is the arcaic character / word forday time(when the sun is above you).- ithuan / 意傳科技's Hokkien TTS for this sentence
- Segmentation Accuracy (
OpenCC+Jieba): To achieve the most accurate word segmentation, the input text is first converted from Traditional to Simplified Chinese. This allowsjiebato leverage its superior optimization for mainland Chinese vocabulary, correctly identifying multi-character words. The results are then converted back to Traditional for the dictionary lookup. - Performance (Parallel Downloads): Dictionary lookups are I/O-bound. To avoid a long sequential wait, a helper script
download_preserve_path_to_dir_structure.pyis used to fetch all dictionary pages in parallel, drastically reducing the total execution time. - Robustness (Fallback Mechanism): Not all words, even when correctly segmented, exist in the dictionary (e.g., new words, slang, or proper nouns). The fallback mechanism that looks up individual characters ensures that the LLM still receives some contextual information, rather than nothing at all.
- Parsing Speed (
selectolax&pyhtml2md): For parsing the downloaded HTML and converting it to Markdown,selectolaxandpyhtml2mdwere chosen over more common libraries like BeautifulSoup andmarkdownifydue to their significantly better performance, which is beneficial when processing many files. - Dependency Management (
uv): The scripts use auv runshebang header. This makes them self-contained by declaring their Python dependencies at the top of the file, allowinguvto create an ephemeral, cached virtual environment automatically. This ensures reproducibility without manualpip installsteps.
This project would not be possible without the excellent work of the following open-source projects and data sources:
- 教育部臺灣閩南語常用詞辭典 (Sutian): For providing an invaluable, high-quality, and openly accessible dictionary for Hokkien.
- ithuan / 意傳科技's Hokkien TTS.
- OpenCC (Open Chinese Convert): For the robust and accurate library that makes the crucial Traditional-Simplified-Traditional conversion workflow possible.
- Jieba: For the powerful and fast Chinese segmentation engine that forms the core of the text processing pipeline.



