1. The Problem: Multi-Task in a PEFT World
Large code models (LCMs) like CodeLlama, DeepSeek-Coder, and Qwen2.5-Coder have made impressive progress on individual code-related tasks, but adapting them to a specific downstream setting still requires fine-tuning. Full fine-tuning is expensive and storage-heavy — one full copy of the model per task — and parameter-efficient fine-tuning (PEFT) methods like QLoRA have emerged as the practical alternative for the single-task case.
But real software engineering workflows are not single-task. A useful coding assistant generates code from a description, summarizes existing functions, translates code between languages, and switches between these modes inside the same session. The interesting question is whether a single QLoRA-fine-tuned model can handle multiple code tasks simultaneously — and not just on functional correctness, but also on the non-functional qualities (complexity, maintainability, code smells) that determine whether the produced code is actually deployable.
To our knowledge, this is the first study to jointly evaluate functional correctness and non-functional code quality in multi-task QLoRA settings. Most prior PEFT work focused either on a single task at a time, or on functional metrics only.
2. What We Did: 15 Models, 3 Tasks, 3 Scales
We took the Qwen2.5-Coder-Instruct family at three scales — 0.5B, 1.5B, and 3B parameters — and ran a controlled comparison across three training configurations and three code-related tasks.
Tasks
- Code Generation — NL → Code
- Code Summarization — Code → NL
- Code Translation — Java ↔ C# (Code → Code)
Training configurations
MT-QLoRA— Multi-task QLoRA (one model, all tasks)ST-QLoRA— Single-task QLoRA (one model per task)MT-FFT— Multi-task Full Fine-Tuning (no PEFT)
Model scales
Qwen2.5-Coder-0.5B-InstructQwen2.5-Coder-1.5B-InstructQwen2.5-Coder-3B-Instruct
That gives 15 trained models total: 3 configurations × 3 scales, where MT-QLoRA and MT-FFT each yield one multi-task model per scale (3 + 3 = 6) and ST-QLoRA yields one model per task per scale (3 tasks × 3 scales = 9). Together: 6 + 9 = 15.
Datasets
We used the standard public benchmarks: CodeXGLUE for code generation, code summarization (Python and Java), and Java↔C# code translation, plus CoderEval for execution-based evaluation of code generation. Details of the per-task splits are summarized in Table 2.
Table 2 — Dataset and Task Setup
| Task | Languages | Source benchmark(s) | Splits |
|---|---|---|---|
| Code Generation | Python, Java | CodeXGLUE (NL→Code), CoderEval | Train / Validation / Test (CodeXGLUE standard splits; CoderEval used for execution-based pass@1) |
| Code Summarization | Python, Java | CodeXGLUE (Code→NL) | Train / Validation / Test (CodeXGLUE standard splits) |
| Code Translation | Java ↔ C# | CodeXGLUE (Code→Code) | Train / Validation / Test (CodeXGLUE standard splits) |
Table 2. Per-task data setup. The exact training/validation/test row counts follow the public CodeXGLUE and CoderEval releases used by the paper.
Metrics
We evaluated along three axes:
- Functional correctness —
pass@1(CoderEval, execution-based) for code generation. - Surface-level quality —
CodeBLEU,BLEU,METEOR,ROUGE-L,chrF,BERTScore, andSIDEfor translation and summarization. - Non-functional code quality — static analysis with
Pylint(Python),PMD(Java),Roslyn(C#),Lizard(cyclomatic and cognitive complexity), andSonarCloud(issues, code smells, maintainability). - LLM-as-a-judge — we additionally used
GPT-5 Minias a judge for summarization quality, providing a second signal beyond reference-based metrics.
For statistical reliability, paired comparisons used McNemar's test (binary outcomes such as pass@1) and the Wilcoxon signed-rank test (continuous metrics), with Holm-Bonferroni correction for multiple comparisons.
By fixing the model family and varying only scale (0.5B/1.5B/3B) and training configuration (MT-QLoRA / ST-QLoRA / MT-FFT), we can attribute differences in performance and code quality to the configuration choice rather than to model architecture. And by combining execution-based metrics, surface metrics, static-analysis metrics, and an LLM judge, we close the gap between “does it pass tests?” and “is it actually good code?”
3. RQ1 — Multi-Task QLoRA vs Single-Task QLoRA
Does training one QLoRA model on three tasks at once cost us anything compared to training a separate QLoRA per task?
Code Generation
Across both Python and Java, MT-QLoRA is competitive with ST-QLoRA, and the benefits of multi-task training increase with model capacity.
| Language | Scale | MT-QLoRA pass@1 | ST-QLoRA pass@1 | Direction |
|---|---|---|---|---|
| Python | 0.5B | MT-QLoRA matches ST-QLoRA | Similar | |
| 1.5B | 18.95% | 16.84% | MT > ST | |
| 3B | 21.05% | 20.53% | MT > ST | |
| Java | 0.5B | Comparable | Similar | |
| 1.5B | Comparable | Similar | ||
| 3B | 32.07% | 29.89% | MT > ST | |
RQ1 — pass@1 for code generation: MT-QLoRA vs ST-QLoRA across the three Qwen2.5-Coder scales.
Static analysis adds an interesting wrinkle: larger MT models produce simpler, more maintainable code. At 3B for Python, MT-QLoRA showed 33.8% fewer maintainability issues and 19.3% lower cognitive complexity than ST-QLoRA. McNemar (for pass@1) and Wilcoxon (for continuous metrics) tests with Holm-Bonferroni correction did not find statistically significant differences in raw pass@1, so the safe summary is: MT-QLoRA matches ST-QLoRA on correctness while improving non-functional code quality at scale.
Code Translation
Translation is more sensitive to direction and scale.
- Java→C#: MT-QLoRA underperforms ST-QLoRA at 0.5B and 3B but slightly improves at 1.5B. MT-QLoRA also consistently produces more compact Java→C# translations.
- C#→Java: similar pattern with 1.5B as the sweet spot. At 3B, C#→Java MT-QLoRA reduces PMD issues by 13.8% and SonarCloud issues by 22.7%.
So translation is the task where MT-QLoRA's quality advantage shows up most clearly in non-functional metrics, even when surface-level translation metrics are mixed.
Code Summarization
Summarization shows a striking language-dependent effect.
| Language | Scale | BLEU change (MT-QLoRA vs ST-QLoRA) | Direction |
|---|---|---|---|
| Python | 0.5B | +36.0% | MT > ST |
| 1.5B | +51.5% | MT > ST | |
| 3B | +28.3% | MT > ST | |
| Java | 0.5B | ST > MT (advantage strengthens with scale) | |
| 1.5B | ST > MT | ||
| 3B | ST > MT | ||
RQ1 — BLEU comparison for code summarization. Python benefits massively from multi-task training; Java's pattern reverses.
The LLM-as-a-judge evaluation (GPT-5 Mini) tells a slightly more nuanced story: for Java summarization, no significant quality differences between MT-QLoRA and ST-QLoRA, while for Python, MT-QLoRA's advantages remain significant at the larger scales.
MT-QLoRA is genuinely competitive with ST-QLoRA on functional metrics, and frequently better on non-functional code quality — especially at 1.5B and 3B. The clearest exception is Java summarization, where single-task training retains an advantage that grows with scale.
4. RQ2 — Multi-Task QLoRA vs Multi-Task Full Fine-Tuning
The next question is whether the parameter-efficient route gives anything up relative to multi-task full fine-tuning.
Code Generation
For both Python and Java, MT-QLoRA matches or exceeds MT-FFT, with the strongest gains at the 3B scale.
Table 11 — Python Code Generation: MT-QLoRA vs MT-FFT
| Scale | MT-QLoRA pass@1 | MT-FFT pass@1 | Δ pass@1 | Lizard / Pylint / SonarCloud |
|---|---|---|---|---|
| 0.5B | MT-QLoRA > MT-FFT | Improved | Lower complexity overall | |
| 1.5B | MT-QLoRA > MT-FFT | Improved | Lower complexity overall | |
| 3B | 21.05% | 17.37% | +21.2% | 32% lower CyC for MT-QLoRA; fewer Pylint and SonarCloud issues |
Table 11. Python code generation, MT-QLoRA vs MT-FFT. MT-QLoRA improves pass@1 at every scale and produces substantially less complex code at 3B.
Table 12 — Java Code Generation: MT-QLoRA vs MT-FFT
| Scale | MT-QLoRA pass@1 | MT-FFT pass@1 | Δ pass@1 | Lizard / PMD / SonarCloud |
|---|---|---|---|---|
| 0.5B | Comparable | Similar | Comparable static-analysis profile | |
| 1.5B | MT-QLoRA ≥ MT-FFT | Improved | Comparable to slightly fewer issues | |
| 3B | 32.07% | 26.63% | +20.4% | Fewer PMD and SonarCloud issues vs MT-FFT |
Table 12. Java code generation, MT-QLoRA vs MT-FFT. MT-QLoRA is substantially better at 3B on both functional and non-functional metrics.
Code Translation
Translation results are more mixed:
- Java→C#: MT-FFT generally leads on CodeBLEU.
- C#→Java: MT-QLoRA is competitive at 1.5B and shows large non-functional quality improvements at 3B — SonarCloud issues drop by 69.6% versus MT-FFT.
Code Summarization
The biggest gap between MT-QLoRA and MT-FFT shows up in Python summarization, where MT-QLoRA dramatically outperforms MT-FFT on BLEU at every scale.
| Language | Scale | BLEU change (MT-QLoRA vs MT-FFT) | Direction |
|---|---|---|---|
| Python | 0.5B | +103.9% | MT-QLoRA > MT-FFT |
| 1.5B | +80.7% | MT-QLoRA > MT-FFT | |
| 3B | +32.7% | MT-QLoRA > MT-FFT | |
| Java | 0.5B | Mixed — MT-QLoRA better on overlap metrics (METEOR, ROUGE-L, SIDE); MT-FFT better on BERTScore | |
| 1.5B | Mixed | ||
| 3B | Mixed — gap on BERTScore narrows at scale | ||
RQ2 — Code summarization. Python heavily favors MT-QLoRA over MT-FFT; Java is metric-dependent.
Multi-task QLoRA is not just cheaper than multi-task full fine-tuning — it is often better, particularly at 3B and particularly for Python. The places MT-FFT still leads (Java→C# CodeBLEU, Java summarization BERTScore at small scales) are narrow and metric-specific, not a general superiority.
5. Key Takeaways
- Multi-task QLoRA is not a cost-saving compromise. It can match or beat both single-task QLoRA and multi-task full fine-tuning, particularly at the 3B scale.
- Bigger helps. Larger models are better at balancing multiple objectives within a parameter-efficient framework. The 3B Qwen2.5-Coder-Instruct is where MT-QLoRA's advantages are most consistent.
- Code quality is not systematically degraded by multi-task PEFT. Across Pylint, PMD, Roslyn, Lizard, and SonarCloud, MT-QLoRA frequently produces lower-complexity, more-maintainable code than both ST-QLoRA and MT-FFT — contradicting the common assumption that multi-task PEFT must trade off code quality.
- Language-dependent effects exist. Python benefits more from multi-task training; Java summarization tends to favor single-task training, with the gap widening at scale.
- Direction matters in translation. The Java→C# and C#→Java directions behave differently, and 1.5B is the most consistent “sweet spot” for MT-QLoRA on translation.
- First joint evaluation of correctness and quality. To our knowledge, this is the first study of multi-task PEFT in code that evaluates functional correctness and non-functional code quality together — a more realistic standard for whether the produced code is actually deployable.
6. Bottom Line
A single QLoRA-optimized model can effectively handle code generation, code translation, and code summarization simultaneously, and the resulting code holds up under both rigorous static analysis and LLM-based evaluation. For practitioners, the practical message is straightforward: train one MT-QLoRA model on top of a Qwen2.5-Coder-class base — preferably at 1.5B or 3B — before reaching for either per-task adapters or full fine-tuning. The exceptions (Java summarization, Java→C# CodeBLEU) are worth knowing, but they don't change the headline.
For researchers, the more interesting message is methodological: evaluating multi-task PEFT on functional metrics alone is leaving most of the story on the table. Static analysis and LLM-as-a-judge change the picture in non-trivial ways — usually in MT-QLoRA's favor.