Comparative Evaluation of Zero-Shot and Few-Shot Performance of Large Language Models in Low-Resource Language Machine Translation
DOI:
https://doi.org/10.66372/JGER.v3i2.5Keywords:
large language models, low-resource machine translation, few-shot learning, zero-shot translationAbstract
Large language models (LLMs) have demonstrated remarkable translation capabilities for high-resource languages, yet their effectiveness on low-resource languages under varying prompting conditions remains insufficiently understood. This study presents a comparative evaluation of four LLMs—GPT-4, GPT-3.5-Turbo, LLaMA-2-70B, and BLOOM-176B—alongside NLLB-200-3.3B as a supervised baseline, across ten translation directions spanning four resource levels. Using the FLORES-200 devtest set as the primary benchmark and NTREX-128 for cross-validation, we assess zero-shot, one-shot, five-shot, and eight-shot configurations with BLEU, chrF++, and COMET-22 metrics. Our results reveal three principal findings. The few-shot advantage is most pronounced for low-resource languages, with GPT-4 achieving an average BLEU gain of 5.3 points when moving from zero-shot to five-shot on low-resource pairs. One-shot prompting consistently degrades performance below zero-shot baselines, with an average BLEU reduction of 1.4 points across low-resource directions. The supervised NLLB-200 baseline outperforms all LLMs in zero-shot on eight of ten directions, while five-shot GPT-4 narrows this gap to within 1.0 BLEU on mid-resource pairs. These findings provide empirical guidance for practitioners selecting prompting strategies for LLM-based translation in resource-constrained settings.

