1. The Problem: Full Fine-Tuning Doesn't Scale
Large code models (LCMs) — CodeLlama, DeepSeek-Coder, CodeT5, StarCoder, and the rest of the family — have pushed code intelligence forward across nearly every software engineering task we care about. But getting these models to actually do something useful for your task usually means fine-tuning, and full fine-tuning is brutal. Updating every weight, holding all gradients in memory, and storing one full model copy per task means hundreds of millions to billions of parameters touched per task, on hardware most labs and companies don't have.
Parameter-efficient fine-tuning (PEFT) changes the contract. Instead of updating the whole model, we freeze it and update a small slice — either by injecting tiny new modules (adapters, prompts, prefixes), reparameterizing weight updates as low-rank matrices (LoRA, QLoRA), or selecting a sparse subset of existing parameters (BitFit). The hope: similar (or better) downstream quality at a small fraction of the compute, memory, and storage cost.
The hope is well-supported in NLP. The question we wanted to answer is what the picture actually looks like in software engineering: which tasks have been studied, which models and methods are being used, and — honestly — whether PEFT holds up against full fine-tuning when applied to code.
2. What We Did: A Systematic Review
We followed Kitchenham's guidelines for systematic literature reviews. We searched four major scholarly databases — IEEE Xplore, ACM Digital Library, Springer Link, and Google Scholar — using carefully constructed query strings combining PEFT terminology with software engineering and large-language-model keywords. We then applied multi-stage inclusion and exclusion criteria covering peer-review status, venue quality, and topical relevance.
Starting from 1,146 candidate papers, the screening process narrowed the corpus down to 28 peer-reviewed studies covering 19 distinct SE tasks. The final set is concentrated in top venues: ICSE, FSE, ASE, TOSEM, TSE, NeurIPS, ACL, EMNLP, and similar.
Three research questions structured the review:
- RQ1: Which SE tasks have been addressed using PEFT?
- RQ2: Which model architectures and PEFT methods have been studied?
- RQ3: How does PEFT compare to full fine-tuning in performance and efficiency?
PEFT in SE is moving fast and the literature is fragmented across NLP, ML, and SE venues. A systematic review forces explicit search strings, explicit filters, and explicit accounting — so the patterns we report aren't cherry-picked from familiar papers, but the actual shape of the field as of submission.
3. RQ1 — Which SE Tasks Use PEFT?
Across the 28 studies, PEFT has been applied to 19 distinct SE tasks, which split cleanly into two groups: 12 generative tasks, where the model produces code or natural language, and 7 non-generative tasks, which are typically classification or matching problems.
The distribution is heavily skewed. Code Summarization is the most-studied target (~46.4% of studies, 13 papers), followed by Code Generation (~35.7%, 10 papers). About 60.7% of the studies applied PEFT to more than one SE task — a healthy sign that researchers are treating PEFT as a general adaptation tool rather than a task-specific trick.
Table 1 — Task Taxonomy
| Task | Description | Type | Studies |
|---|---|---|---|
| Code Summarization | Generate a natural-language description of a code snippet | Generative | 13 |
| Code Generation | Produce code from a natural-language specification | Generative | 10 |
| Code Translation | Translate code from one programming language to another | Generative | Multiple |
| Automated Program Repair (APR) | Generate a fix for a buggy program | Generative | Multiple |
| Code Refinement | Improve existing code (style, structure, correctness) | Generative | Multiple |
| Commit Message Generation | Produce a commit message describing a code change | Generative | Multiple |
| Code Review Generation | Generate review comments for proposed changes | Generative | Multiple |
| Code Completion | Predict the next token(s) of a partially written program | Generative | Multiple |
| Unit Test Generation | Generate unit tests for a given function or class | Generative | Multiple |
| Just-In-Time Comment Update (JITCU) | Update inline comments when the surrounding code changes | Generative | Multiple |
| Method Name Recommendation | Suggest a name for a method based on its body | Generative | Multiple |
| Protocol Buffer Transformation | Translate between Protocol Buffer schemas/representations | Generative | Multiple |
| Clone Detection | Determine whether two snippets are semantically equivalent | Non-generative | Multiple |
| Defect Detection | Classify code as buggy or correct | Non-generative | Multiple |
| Code Search | Retrieve code matching a natural-language query | Non-generative | Multiple |
| Cloze Test | Predict masked tokens in a code sequence | Non-generative | Multiple |
| Code Review (classification) | Classify a review comment or accept/reject a change | Non-generative | Multiple |
| Header File Prediction | Predict the correct header file(s) for a piece of code | Non-generative | Multiple |
| Method Name Consistency Check | Decide whether a method name matches its body | Non-generative | Multiple |
Table 1. The 19 SE tasks covered by the 28 reviewed studies, categorized as generative (12) or non-generative (7).
4. RQ2 — Which Architectures and PEFT Methods?
Architectures
The architecture distribution across reviewed studies looks like this:
- Encoder-decoder models are most prevalent: 29 task instances across 12 studies. CodeT5 alone is adapted 19 times across 8 studies, making it the single most frequently fine-tuned model in the corpus.
- Encoder-only models: 22 task instances across 10 studies. Largely CodeBERT, GraphCodeBERT, and similar.
- Decoder-only models: 19 task instances across 12 studies. The CodeLlama family shows particularly fast-growing adoption, and decoder-only models appear exclusively in generative tasks — an architectural fit, but also a real gap when it comes to classification-style SE problems.
PEFT Methods
On the method side, four families dominate the field, with one clearly leading:
- Base LoRA is the dominant PEFT technique with 29 SE task instances.
- Adapter family (Base Adapter, L-Adapter, T-Adapter, etc.): 23 task instances.
- Prompt Tuning: 14 task instances.
- Prefix Tuning: 13 task instances.
Figure 3 — PEFT Taxonomy
The reviewed PEFT methods organize into four structural families. Where each method appears, it is mapped to the tasks for which it has been used in the reviewed corpus:
Additive
Base Adapter— summarization, generation, translation, APR, defect & clone detectionL-Adapter— cross-lingual summarization, translation, code reviewT-Adapter— task-specific adapters for summarization, generation, searchPrompt Tuning— generation, summarization, defect detectionPrefix Tuning— generation, summarization, refinementP-Tuning— generation, classificationPass-Tuning— structure-aware code understanding tasks(IA)3— lightweight adaptation for generative tasks
Reparameterized
Base LoRA— generation, summarization, translation, APR, completion, refinement, review generation, clone & defect detectionQLoRA— generation, summarization, APR (with quantized base models)FF-LoRA— feed-forward-targeted LoRA for generative tasksAdaLoRA— adaptive-rank LoRA, only sparsely evaluated
Selective
BitFit— bias-only tuning; sparsely evaluated in SETelly-K— layer-selective freezing for code understanding
Hybrid
MAM— mixed adapter + prefix configurationsL-Adapter + T-Adapter— cross-lingual + task adapters stackedFF-LoRA + Adapter— combined reparameterized + additiveL-Adapter + NER-Adapter + AdapterFusion— multi-adapter fusion for code review
Figure 3. PEFT method taxonomy with SE tasks mapped to each method, organized into the four structural families.
Reparameterized methods — LoRA and its variants — are the most-used and best-performing family in the reviewed corpus. Additive methods (adapters, prompts, prefixes) are well-represented but spread more thinly across tasks. Selective methods (BitFit, Telly-K) are essentially under-evaluated in SE.
5. RQ3 — How Does PEFT Compare to Full Fine-Tuning?
This is the question every practitioner actually cares about: does PEFT match full fine-tuning on quality, and does it actually save anything in practice?
For the generative tasks, the comparisons are encouraging:
- ~72% of comparisons reported performance improvements over full fine-tuning.
- Over 54% reported efficiency gains as well (lower memory, fewer trainable parameters, faster training, or smaller checkpoints).
For the non-generative tasks:
- ~62% reported performance improvements.
- ~50% showed efficiency gains.
Across both categories, LoRA-based methods (Base LoRA and QLoRA) were consistently the strongest performers. They tend to match or beat full fine-tuning while training on the order of one to two percent of the original model's parameters.
Table 6 — Best-performing PEFT Methods vs. Full Fine-Tuning
| Task category / task | Best-performing PEFT method(s) | Performance | Efficiency |
|---|---|---|---|
| Generative tasks | |||
| Code Summarization | Base LoRA, Adapter family | Improved | Gain |
| Code Generation | Base LoRA, QLoRA | Improved | Gain |
| Code Translation | Base LoRA, L-Adapter | Improved | Mixed |
| Automated Program Repair | Base LoRA, QLoRA | Improved | Gain |
| Code Refinement | Base LoRA, Prefix Tuning | Similar | Gain |
| Commit Message Generation | Adapter family | Improved | Gain |
| Code Review Generation | Adapter Fusion (hybrid) | Improved | Gain |
| Code Completion | Base LoRA | Improved | Gain |
| Unit Test Generation | Base LoRA | Similar | Gain |
| JITCU / Method Name Rec. / Protocol Buffer | Adapter family, Prompt Tuning | Improved | Mixed |
| Non-generative tasks | |||
| Clone Detection | Base LoRA, Adapter family | Improved | Gain |
| Defect Detection | Base LoRA, Prompt Tuning | Improved | Gain |
| Code Search | Adapter family, Pass-Tuning | Improved | Mixed |
| Cloze Test | Telly-K, Adapter family | Similar | Gain |
| Code Review (classification) | Adapter Fusion (hybrid) | Improved | Gain |
| Header File Prediction / Method Name Check | Base LoRA, Adapter family | Similar | Gain |
Table 6. Best-performing PEFT methods reported across reviewed studies, with directional indicators relative to full fine-tuning. Improved = at least one reported metric exceeded full FT; Similar = comparable; Mixed = setup-dependent. LoRA-based methods dominate the “best method” column.
About 29% of the reviewed works did not include a direct comparison with full fine-tuning. That is a meaningful blind spot: when nearly a third of the literature skips the head-to-head, the “PEFT matches or beats full FT” story is statistically supported but methodologically thinner than it should be. Future work should make full-FT baselines mandatory rather than optional.
6. What's Still Missing
The same review that shows where PEFT in SE is strong also shows, very clearly, where it isn't yet:
- Method coverage gaps.
AdaLoRA,P-Tuning, andBitFitare largely unevaluated in SE despite being well-studied in NLP. We don't yet know whether their NLP wins transfer to code. - Task coverage gaps. Software testing, software documentation generation, and refactoring are notably underexplored. The corpus skews heavily toward summarization and generation, and away from large parts of the SE lifecycle.
- Architectural asymmetry. Decoder-only LCMs are used only for generative tasks in the reviewed corpus — their potential for classification/matching SE tasks is unexamined.
- Benchmark fragmentation. Studies often use different datasets, metrics, and base models, which makes apples-to-apples comparison genuinely hard. Standardized PEFT-for-SE benchmarks would do a lot of good here.
Section 6 of the paper sketches a six-step roadmap for future work that we think the community should treat as a checklist:
- Cover the under-evaluated PEFT methods (AdaLoRA, P-Tuning, BitFit) on standard SE tasks.
- Extend PEFT studies to under-explored SE tasks: testing, documentation, refactoring, security.
- Evaluate decoder-only LCMs on non-generative SE tasks, not just generative ones.
- Make full-fine-tuning baselines mandatory in PEFT-for-SE evaluations.
- Standardize benchmarks, datasets, and efficiency reporting across studies.
- Investigate hybrid configurations more systematically — the early hybrid results (e.g., AdapterFusion variants) suggest there is real signal there.
7. Bottom Line
PEFT in software engineering is not a workaround for not having enough GPUs. The reviewed evidence makes that clear: across 28 studies, 19 SE tasks, and four method families, PEFT consistently matches or improves on full fine-tuning while training a tiny fraction of the parameters. It's a robust optimization paradigm, not a compromise — and as LCMs continue to grow, it will be the default way to adapt them to SE tasks, not the exception.
If you're starting an SE project on top of an LCM today, the safest, most-evidenced first move based on this review is straightforward: start with Base LoRA or QLoRA on top of a CodeT5 or CodeLlama-family base model, and only deviate when you have a specific reason to. Everything else in the design space — adapters, prompts, prefixes, hybrids — is worth exploring, but is more sparsely evaluated in SE so far.