Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks

1. The Problem: Multi-Task in a PEFT World

Large code models (LCMs) like CodeLlama, DeepSeek-Coder, and Qwen2.5-Coder have made impressive progress on individual code-related tasks, but adapting them to a specific downstream setting still requires fine-tuning. Full fine-tuning is expensive and storage-heavy — one full copy of the model per task — and parameter-efficient fine-tuning (PEFT) methods like QLoRA have emerged as the practical alternative for the single-task case.

But real software engineering workflows are not single-task. A useful coding assistant generates code from a description, summarizes existing functions, translates code between languages, and switches between these modes inside the same session. The interesting question is whether a single QLoRA-fine-tuned model can handle multiple code tasks simultaneously — and not just on functional correctness, but also on the non-functional qualities (complexity, maintainability, code smells) that determine whether the produced code is actually deployable.

To our knowledge, this is the first study to jointly evaluate functional correctness and non-functional code quality in multi-task QLoRA settings. Most prior PEFT work focused either on a single task at a time, or on functional metrics only.

2. What We Did: 15 Models, 3 Tasks, 3 Scales

We took the Qwen2.5-Coder-Instruct family at three scales — 0.5B, 1.5B, and 3B parameters — and ran a controlled comparison across three training configurations and three code-related tasks.

Tasks

Code Generation — NL → Code
Code Summarization — Code → NL
Code Translation — Java ↔ C# (Code → Code)

Training configurations

MT-QLoRA — Multi-task QLoRA (one model, all tasks)
ST-QLoRA — Single-task QLoRA (one model per task)
MT-FFT — Multi-task Full Fine-Tuning (no PEFT)

Model scales

Qwen2.5-Coder-0.5B-Instruct
Qwen2.5-Coder-1.5B-Instruct
Qwen2.5-Coder-3B-Instruct

That gives 15 trained models total: 3 configurations × 3 scales, where MT-QLoRA and MT-FFT each yield one multi-task model per scale (3 + 3 = 6) and ST-QLoRA yields one model per task per scale (3 tasks × 3 scales = 9). Together: 6 + 9 = 15.

Datasets

We used the standard public benchmarks: CodeXGLUE for code generation, code summarization (Python and Java), and Java↔C# code translation, plus CoderEval for execution-based evaluation of code generation. Details of the per-task splits are summarized in Table 2.

Table 2 — Dataset and Task Setup

Task	Languages	Source benchmark(s)	Splits
Code Generation	Python, Java	CodeXGLUE (NL→Code), CoderEval	Train / Validation / Test (CodeXGLUE standard splits; CoderEval used for execution-based pass@1)
Code Summarization	Python, Java	CodeXGLUE (Code→NL)	Train / Validation / Test (CodeXGLUE standard splits)
Code Translation	Java ↔ C#	CodeXGLUE (Code→Code)	Train / Validation / Test (CodeXGLUE standard splits)

Table 2. Per-task data setup. The exact training/validation/test row counts follow the public CodeXGLUE and CoderEval releases used by the paper.

Metrics

We evaluated along three axes:

Functional correctness — pass@1 (CoderEval, execution-based) for code generation.
Surface-level quality — CodeBLEU, BLEU, METEOR, ROUGE-L, chrF, BERTScore, and SIDE for translation and summarization.
Non-functional code quality — static analysis with Pylint (Python), PMD (Java), Roslyn (C#), Lizard (cyclomatic and cognitive complexity), and SonarCloud (issues, code smells, maintainability).
LLM-as-a-judge — we additionally used GPT-5 Mini as a judge for summarization quality, providing a second signal beyond reference-based metrics.

For statistical reliability, paired comparisons used McNemar's test (binary outcomes such as pass@1) and the Wilcoxon signed-rank test (continuous metrics), with Holm-Bonferroni correction for multiple comparisons.

Why this design

By fixing the model family and varying only scale (0.5B/1.5B/3B) and training configuration (MT-QLoRA / ST-QLoRA / MT-FFT), we can attribute differences in performance and code quality to the configuration choice rather than to model architecture. And by combining execution-based metrics, surface metrics, static-analysis metrics, and an LLM judge, we close the gap between “does it pass tests?” and “is it actually good code?”

3. RQ1 — Multi-Task QLoRA vs Single-Task QLoRA

Does training one QLoRA model on three tasks at once cost us anything compared to training a separate QLoRA per task?

Code Generation

Across both Python and Java, MT-QLoRA is competitive with ST-QLoRA, and the benefits of multi-task training increase with model capacity.

Language	Scale	MT-QLoRA pass@1	ST-QLoRA pass@1	Direction
Python	0.5B	MT-QLoRA matches ST-QLoRA		Similar
	1.5B	18.95%	16.84%	MT > ST
	3B	21.05%	20.53%	MT > ST
Java	0.5B	Comparable		Similar
	1.5B	Comparable		Similar
	3B	32.07%	29.89%	MT > ST

RQ1 — pass@1 for code generation: MT-QLoRA vs ST-QLoRA across the three Qwen2.5-Coder scales.

Static analysis adds an interesting wrinkle: larger MT models produce simpler, more maintainable code. At 3B for Python, MT-QLoRA showed 33.8% fewer maintainability issues and 19.3% lower cognitive complexity than ST-QLoRA. McNemar (for pass@1) and Wilcoxon (for continuous metrics) tests with Holm-Bonferroni correction did not find statistically significant differences in raw pass@1, so the safe summary is: MT-QLoRA matches ST-QLoRA on correctness while improving non-functional code quality at scale.

Code Translation

Translation is more sensitive to direction and scale.

Java→C#: MT-QLoRA underperforms ST-QLoRA at 0.5B and 3B but slightly improves at 1.5B. MT-QLoRA also consistently produces more compact Java→C# translations.
C#→Java: similar pattern with 1.5B as the sweet spot. At 3B, C#→Java MT-QLoRA reduces PMD issues by 13.8% and SonarCloud issues by 22.7%.

So translation is the task where MT-QLoRA's quality advantage shows up most clearly in non-functional metrics, even when surface-level translation metrics are mixed.

Code Summarization

Summarization shows a striking language-dependent effect.

Language	Scale	BLEU change (MT-QLoRA vs ST-QLoRA)	Direction
Python	0.5B	+36.0%	MT > ST
	1.5B	+51.5%	MT > ST
	3B	+28.3%	MT > ST
Java	0.5B	ST > MT (advantage strengthens with scale)
	1.5B	ST > MT
	3B	ST > MT

RQ1 — BLEU comparison for code summarization. Python benefits massively from multi-task training; Java's pattern reverses.

The LLM-as-a-judge evaluation (GPT-5 Mini) tells a slightly more nuanced story: for Java summarization, no significant quality differences between MT-QLoRA and ST-QLoRA, while for Python, MT-QLoRA's advantages remain significant at the larger scales.

RQ1 takeaway

MT-QLoRA is genuinely competitive with ST-QLoRA on functional metrics, and frequently better on non-functional code quality — especially at 1.5B and 3B. The clearest exception is Java summarization, where single-task training retains an advantage that grows with scale.

4. RQ2 — Multi-Task QLoRA vs Multi-Task Full Fine-Tuning

The next question is whether the parameter-efficient route gives anything up relative to multi-task full fine-tuning.

Code Generation

For both Python and Java, MT-QLoRA matches or exceeds MT-FFT, with the strongest gains at the 3B scale.

Table 11 — Python Code Generation: MT-QLoRA vs MT-FFT

Scale	MT-QLoRA pass@1	MT-FFT pass@1	Δ pass@1	Lizard / Pylint / SonarCloud
0.5B	MT-QLoRA > MT-FFT		Improved	Lower complexity overall
1.5B	MT-QLoRA > MT-FFT		Improved	Lower complexity overall
3B	21.05%	17.37%	+21.2%	32% lower CyC for MT-QLoRA; fewer Pylint and SonarCloud issues

Table 11. Python code generation, MT-QLoRA vs MT-FFT. MT-QLoRA improves pass@1 at every scale and produces substantially less complex code at 3B.

Table 12 — Java Code Generation: MT-QLoRA vs MT-FFT

Scale	MT-QLoRA pass@1	MT-FFT pass@1	Δ pass@1	Lizard / PMD / SonarCloud
0.5B	Comparable		Similar	Comparable static-analysis profile
1.5B	MT-QLoRA ≥ MT-FFT		Improved	Comparable to slightly fewer issues
3B	32.07%	26.63%	+20.4%	Fewer PMD and SonarCloud issues vs MT-FFT

Table 12. Java code generation, MT-QLoRA vs MT-FFT. MT-QLoRA is substantially better at 3B on both functional and non-functional metrics.

Code Translation

Translation results are more mixed:

Java→C#: MT-FFT generally leads on CodeBLEU.
C#→Java: MT-QLoRA is competitive at 1.5B and shows large non-functional quality improvements at 3B — SonarCloud issues drop by 69.6% versus MT-FFT.

Code Summarization

The biggest gap between MT-QLoRA and MT-FFT shows up in Python summarization, where MT-QLoRA dramatically outperforms MT-FFT on BLEU at every scale.

Language	Scale	BLEU change (MT-QLoRA vs MT-FFT)	Direction
Python	0.5B	+103.9%	MT-QLoRA > MT-FFT
	1.5B	+80.7%	MT-QLoRA > MT-FFT
	3B	+32.7%	MT-QLoRA > MT-FFT
Java	0.5B	Mixed — MT-QLoRA better on overlap metrics (METEOR, ROUGE-L, SIDE); MT-FFT better on BERTScore
	1.5B	Mixed
	3B	Mixed — gap on BERTScore narrows at scale

RQ2 — Code summarization. Python heavily favors MT-QLoRA over MT-FFT; Java is metric-dependent.

RQ2 takeaway

Multi-task QLoRA is not just cheaper than multi-task full fine-tuning — it is often better, particularly at 3B and particularly for Python. The places MT-FFT still leads (Java→C# CodeBLEU, Java summarization BERTScore at small scales) are narrow and metric-specific, not a general superiority.

5. Key Takeaways

Multi-task QLoRA is not a cost-saving compromise. It can match or beat both single-task QLoRA and multi-task full fine-tuning, particularly at the 3B scale.
Bigger helps. Larger models are better at balancing multiple objectives within a parameter-efficient framework. The 3B Qwen2.5-Coder-Instruct is where MT-QLoRA's advantages are most consistent.
Code quality is not systematically degraded by multi-task PEFT. Across Pylint, PMD, Roslyn, Lizard, and SonarCloud, MT-QLoRA frequently produces lower-complexity, more-maintainable code than both ST-QLoRA and MT-FFT — contradicting the common assumption that multi-task PEFT must trade off code quality.
Language-dependent effects exist. Python benefits more from multi-task training; Java summarization tends to favor single-task training, with the gap widening at scale.
Direction matters in translation. The Java→C# and C#→Java directions behave differently, and 1.5B is the most consistent “sweet spot” for MT-QLoRA on translation.
First joint evaluation of correctness and quality. To our knowledge, this is the first study of multi-task PEFT in code that evaluates functional correctness and non-functional code quality together — a more realistic standard for whether the produced code is actually deployable.

6. Bottom Line

A single QLoRA-optimized model can effectively handle code generation, code translation, and code summarization simultaneously, and the resulting code holds up under both rigorous static analysis and LLM-based evaluation. For practitioners, the practical message is straightforward: train one MT-QLoRA model on top of a Qwen2.5-Coder-class base — preferably at 1.5B or 3B — before reaching for either per-task adapters or full fine-tuning. The exceptions (Java summarization, Java→C# CodeBLEU) are worth knowing, but they don't change the headline.

For researchers, the more interesting message is methodological: evaluating multi-task PEFT on functional metrics alone is leaving most of the story on the table. Static analysis and LLM-as-a-judge change the picture in non-trivial ways — usually in MT-QLoRA's favor.

Read the preprint (arXiv) Replication Package (GitHub)

Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks

Contents

1. The Problem: Multi-Task in a PEFT World

2. What We Did: 15 Models, 3 Tasks, 3 Scales

Tasks

Training configurations

Model scales

Datasets

Table 2 — Dataset and Task Setup

Metrics

3. RQ1 — Multi-Task QLoRA vs Single-Task QLoRA

Code Generation

Code Translation

Code Summarization

4. RQ2 — Multi-Task QLoRA vs Multi-Task Full Fine-Tuning

Code Generation

Table 11 — Python Code Generation: MT-QLoRA vs MT-FFT

Table 12 — Java Code Generation: MT-QLoRA vs MT-FFT

Code Translation

Code Summarization

5. Key Takeaways

6. Bottom Line