1. The Problem: Multi-Task in a PEFT World

Large code models (LCMs) like CodeLlama, DeepSeek-Coder, and Qwen2.5-Coder have made impressive progress on individual code-related tasks, but adapting them to a specific downstream setting still requires fine-tuning. Full fine-tuning is expensive and storage-heavy — one full copy of the model per task — and parameter-efficient fine-tuning (PEFT) methods like QLoRA have emerged as the practical alternative for the single-task case.

But real software engineering workflows are not single-task. A useful coding assistant generates code from a description, summarizes existing functions, translates code between languages, and switches between these modes inside the same session. The interesting question is whether a single QLoRA-fine-tuned model can handle multiple code tasks simultaneously — and not just on functional correctness, but also on the non-functional qualities (complexity, maintainability, code smells) that determine whether the produced code is actually deployable.

To our knowledge, this is the first study to jointly evaluate functional correctness and non-functional code quality in multi-task QLoRA settings. Most prior PEFT work focused either on a single task at a time, or on functional metrics only.

2. What We Did: 15 Models, 3 Tasks, 3 Scales

We took the Qwen2.5-Coder-Instruct family at three scales — 0.5B, 1.5B, and 3B parameters — and ran a controlled comparison across three training configurations and three code-related tasks.

Tasks

  • Code Generation — NL → Code
  • Code Summarization — Code → NL
  • Code Translation — Java ↔ C# (Code → Code)

Training configurations

  • MT-QLoRA — Multi-task QLoRA (one model, all tasks)
  • ST-QLoRA — Single-task QLoRA (one model per task)
  • MT-FFT — Multi-task Full Fine-Tuning (no PEFT)

Model scales

  • Qwen2.5-Coder-0.5B-Instruct
  • Qwen2.5-Coder-1.5B-Instruct
  • Qwen2.5-Coder-3B-Instruct

That gives 15 trained models total: 3 configurations × 3 scales, where MT-QLoRA and MT-FFT each yield one multi-task model per scale (3 + 3 = 6) and ST-QLoRA yields one model per task per scale (3 tasks × 3 scales = 9). Together: 6 + 9 = 15.

Datasets

We used the standard public benchmarks: CodeXGLUE for code generation, code summarization (Python and Java), and Java↔C# code translation, plus CoderEval for execution-based evaluation of code generation. Details of the per-task splits are summarized in Table 2.

Table 2 — Dataset and Task Setup

Task Languages Source benchmark(s) Splits
Code GenerationPython, JavaCodeXGLUE (NL→Code), CoderEvalTrain / Validation / Test (CodeXGLUE standard splits; CoderEval used for execution-based pass@1)
Code SummarizationPython, JavaCodeXGLUE (Code→NL)Train / Validation / Test (CodeXGLUE standard splits)
Code TranslationJava ↔ C#CodeXGLUE (Code→Code)Train / Validation / Test (CodeXGLUE standard splits)

Table 2. Per-task data setup. The exact training/validation/test row counts follow the public CodeXGLUE and CoderEval releases used by the paper.

Metrics

We evaluated along three axes:

  • Functional correctnesspass@1 (CoderEval, execution-based) for code generation.
  • Surface-level qualityCodeBLEU, BLEU, METEOR, ROUGE-L, chrF, BERTScore, and SIDE for translation and summarization.
  • Non-functional code quality — static analysis with Pylint (Python), PMD (Java), Roslyn (C#), Lizard (cyclomatic and cognitive complexity), and SonarCloud (issues, code smells, maintainability).
  • LLM-as-a-judge — we additionally used GPT-5 Mini as a judge for summarization quality, providing a second signal beyond reference-based metrics.

For statistical reliability, paired comparisons used McNemar's test (binary outcomes such as pass@1) and the Wilcoxon signed-rank test (continuous metrics), with Holm-Bonferroni correction for multiple comparisons.

Why this design

By fixing the model family and varying only scale (0.5B/1.5B/3B) and training configuration (MT-QLoRA / ST-QLoRA / MT-FFT), we can attribute differences in performance and code quality to the configuration choice rather than to model architecture. And by combining execution-based metrics, surface metrics, static-analysis metrics, and an LLM judge, we close the gap between “does it pass tests?” and “is it actually good code?”

3. RQ1 — Multi-Task QLoRA vs Single-Task QLoRA

Does training one QLoRA model on three tasks at once cost us anything compared to training a separate QLoRA per task?

Code Generation

Across both Python and Java, MT-QLoRA is competitive with ST-QLoRA, and the benefits of multi-task training increase with model capacity.

Language Scale MT-QLoRA pass@1 ST-QLoRA pass@1 Direction
Python0.5BMT-QLoRA matches ST-QLoRASimilar
1.5B18.95%16.84%MT > ST
3B21.05%20.53%MT > ST
Java0.5BComparableSimilar
1.5BComparableSimilar
3B32.07%29.89%MT > ST

RQ1 — pass@1 for code generation: MT-QLoRA vs ST-QLoRA across the three Qwen2.5-Coder scales.

Static analysis adds an interesting wrinkle: larger MT models produce simpler, more maintainable code. At 3B for Python, MT-QLoRA showed 33.8% fewer maintainability issues and 19.3% lower cognitive complexity than ST-QLoRA. McNemar (for pass@1) and Wilcoxon (for continuous metrics) tests with Holm-Bonferroni correction did not find statistically significant differences in raw pass@1, so the safe summary is: MT-QLoRA matches ST-QLoRA on correctness while improving non-functional code quality at scale.

Code Translation

Translation is more sensitive to direction and scale.

  • Java→C#: MT-QLoRA underperforms ST-QLoRA at 0.5B and 3B but slightly improves at 1.5B. MT-QLoRA also consistently produces more compact Java→C# translations.
  • C#→Java: similar pattern with 1.5B as the sweet spot. At 3B, C#→Java MT-QLoRA reduces PMD issues by 13.8% and SonarCloud issues by 22.7%.

So translation is the task where MT-QLoRA's quality advantage shows up most clearly in non-functional metrics, even when surface-level translation metrics are mixed.

Code Summarization

Summarization shows a striking language-dependent effect.

Language Scale BLEU change (MT-QLoRA vs ST-QLoRA) Direction
Python0.5B+36.0%MT > ST
1.5B+51.5%MT > ST
3B+28.3%MT > ST
Java0.5BST > MT (advantage strengthens with scale)
1.5BST > MT
3BST > MT

RQ1 — BLEU comparison for code summarization. Python benefits massively from multi-task training; Java's pattern reverses.

The LLM-as-a-judge evaluation (GPT-5 Mini) tells a slightly more nuanced story: for Java summarization, no significant quality differences between MT-QLoRA and ST-QLoRA, while for Python, MT-QLoRA's advantages remain significant at the larger scales.

RQ1 takeaway

MT-QLoRA is genuinely competitive with ST-QLoRA on functional metrics, and frequently better on non-functional code quality — especially at 1.5B and 3B. The clearest exception is Java summarization, where single-task training retains an advantage that grows with scale.

4. RQ2 — Multi-Task QLoRA vs Multi-Task Full Fine-Tuning

The next question is whether the parameter-efficient route gives anything up relative to multi-task full fine-tuning.

Code Generation

For both Python and Java, MT-QLoRA matches or exceeds MT-FFT, with the strongest gains at the 3B scale.

Table 11 — Python Code Generation: MT-QLoRA vs MT-FFT

Scale MT-QLoRA pass@1 MT-FFT pass@1 Δ pass@1 Lizard / Pylint / SonarCloud
0.5BMT-QLoRA > MT-FFTImprovedLower complexity overall
1.5BMT-QLoRA > MT-FFTImprovedLower complexity overall
3B21.05%17.37%+21.2%32% lower CyC for MT-QLoRA; fewer Pylint and SonarCloud issues

Table 11. Python code generation, MT-QLoRA vs MT-FFT. MT-QLoRA improves pass@1 at every scale and produces substantially less complex code at 3B.

Table 12 — Java Code Generation: MT-QLoRA vs MT-FFT

Scale MT-QLoRA pass@1 MT-FFT pass@1 Δ pass@1 Lizard / PMD / SonarCloud
0.5BComparableSimilarComparable static-analysis profile
1.5BMT-QLoRA ≥ MT-FFTImprovedComparable to slightly fewer issues
3B32.07%26.63%+20.4%Fewer PMD and SonarCloud issues vs MT-FFT

Table 12. Java code generation, MT-QLoRA vs MT-FFT. MT-QLoRA is substantially better at 3B on both functional and non-functional metrics.

Code Translation

Translation results are more mixed:

  • Java→C#: MT-FFT generally leads on CodeBLEU.
  • C#→Java: MT-QLoRA is competitive at 1.5B and shows large non-functional quality improvements at 3B — SonarCloud issues drop by 69.6% versus MT-FFT.

Code Summarization

The biggest gap between MT-QLoRA and MT-FFT shows up in Python summarization, where MT-QLoRA dramatically outperforms MT-FFT on BLEU at every scale.

Language Scale BLEU change (MT-QLoRA vs MT-FFT) Direction
Python0.5B+103.9%MT-QLoRA > MT-FFT
1.5B+80.7%MT-QLoRA > MT-FFT
3B+32.7%MT-QLoRA > MT-FFT
Java0.5BMixed — MT-QLoRA better on overlap metrics (METEOR, ROUGE-L, SIDE); MT-FFT better on BERTScore
1.5BMixed
3BMixed — gap on BERTScore narrows at scale

RQ2 — Code summarization. Python heavily favors MT-QLoRA over MT-FFT; Java is metric-dependent.

RQ2 takeaway

Multi-task QLoRA is not just cheaper than multi-task full fine-tuning — it is often better, particularly at 3B and particularly for Python. The places MT-FFT still leads (Java→C# CodeBLEU, Java summarization BERTScore at small scales) are narrow and metric-specific, not a general superiority.

5. Key Takeaways

  • Multi-task QLoRA is not a cost-saving compromise. It can match or beat both single-task QLoRA and multi-task full fine-tuning, particularly at the 3B scale.
  • Bigger helps. Larger models are better at balancing multiple objectives within a parameter-efficient framework. The 3B Qwen2.5-Coder-Instruct is where MT-QLoRA's advantages are most consistent.
  • Code quality is not systematically degraded by multi-task PEFT. Across Pylint, PMD, Roslyn, Lizard, and SonarCloud, MT-QLoRA frequently produces lower-complexity, more-maintainable code than both ST-QLoRA and MT-FFT — contradicting the common assumption that multi-task PEFT must trade off code quality.
  • Language-dependent effects exist. Python benefits more from multi-task training; Java summarization tends to favor single-task training, with the gap widening at scale.
  • Direction matters in translation. The Java→C# and C#→Java directions behave differently, and 1.5B is the most consistent “sweet spot” for MT-QLoRA on translation.
  • First joint evaluation of correctness and quality. To our knowledge, this is the first study of multi-task PEFT in code that evaluates functional correctness and non-functional code quality together — a more realistic standard for whether the produced code is actually deployable.

6. Bottom Line

A single QLoRA-optimized model can effectively handle code generation, code translation, and code summarization simultaneously, and the resulting code holds up under both rigorous static analysis and LLM-based evaluation. For practitioners, the practical message is straightforward: train one MT-QLoRA model on top of a Qwen2.5-Coder-class base — preferably at 1.5B or 3B — before reaching for either per-task adapters or full fine-tuning. The exceptions (Java summarization, Java→C# CodeBLEU) are worth knowing, but they don't change the headline.

For researchers, the more interesting message is methodological: evaluating multi-task PEFT on functional metrics alone is leaving most of the story on the table. Static analysis and LLM-as-a-judge change the picture in non-trivial ways — usually in MT-QLoRA's favor.