A Systematic Literature Review of PEFT for Large Code Models

1. The Problem: Full Fine-Tuning Doesn't Scale

Large code models (LCMs) — CodeLlama, DeepSeek-Coder, CodeT5, StarCoder, and the rest of the family — have pushed code intelligence forward across nearly every software engineering task we care about. But getting these models to actually do something useful for your task usually means fine-tuning, and full fine-tuning is brutal. Updating every weight, holding all gradients in memory, and storing one full model copy per task means hundreds of millions to billions of parameters touched per task, on hardware most labs and companies don't have.

Parameter-efficient fine-tuning (PEFT) changes the contract. Instead of updating the whole model, we freeze it and update a small slice — either by injecting tiny new modules (adapters, prompts, prefixes), reparameterizing weight updates as low-rank matrices (LoRA, QLoRA), or selecting a sparse subset of existing parameters (BitFit). The hope: similar (or better) downstream quality at a small fraction of the compute, memory, and storage cost.

The hope is well-supported in NLP. The question we wanted to answer is what the picture actually looks like in software engineering: which tasks have been studied, which models and methods are being used, and — honestly — whether PEFT holds up against full fine-tuning when applied to code.

2. What We Did: A Systematic Review

We followed Kitchenham's guidelines for systematic literature reviews. We searched four major scholarly databases — IEEE Xplore, ACM Digital Library, Springer Link, and Google Scholar — using carefully constructed query strings combining PEFT terminology with software engineering and large-language-model keywords. We then applied multi-stage inclusion and exclusion criteria covering peer-review status, venue quality, and topical relevance.

Starting from 1,146 candidate papers, the screening process narrowed the corpus down to 28 peer-reviewed studies covering 19 distinct SE tasks. The final set is concentrated in top venues: ICSE, FSE, ASE, TOSEM, TSE, NeurIPS, ACL, EMNLP, and similar.

Three research questions structured the review:

RQ1: Which SE tasks have been addressed using PEFT?
RQ2: Which model architectures and PEFT methods have been studied?
RQ3: How does PEFT compare to full fine-tuning in performance and efficiency?

Why a systematic review

PEFT in SE is moving fast and the literature is fragmented across NLP, ML, and SE venues. A systematic review forces explicit search strings, explicit filters, and explicit accounting — so the patterns we report aren't cherry-picked from familiar papers, but the actual shape of the field as of submission.

3. RQ1 — Which SE Tasks Use PEFT?

Across the 28 studies, PEFT has been applied to 19 distinct SE tasks, which split cleanly into two groups: 12 generative tasks, where the model produces code or natural language, and 7 non-generative tasks, which are typically classification or matching problems.

The distribution is heavily skewed. Code Summarization is the most-studied target (~46.4% of studies, 13 papers), followed by Code Generation (~35.7%, 10 papers). About 60.7% of the studies applied PEFT to more than one SE task — a healthy sign that researchers are treating PEFT as a general adaptation tool rather than a task-specific trick.

Table 1 — Task Taxonomy

Task	Description	Type	Studies
Code Summarization	Generate a natural-language description of a code snippet	Generative	13
Code Generation	Produce code from a natural-language specification	Generative	10
Code Translation	Translate code from one programming language to another	Generative	Multiple
Automated Program Repair (APR)	Generate a fix for a buggy program	Generative	Multiple
Code Refinement	Improve existing code (style, structure, correctness)	Generative	Multiple
Commit Message Generation	Produce a commit message describing a code change	Generative	Multiple
Code Review Generation	Generate review comments for proposed changes	Generative	Multiple
Code Completion	Predict the next token(s) of a partially written program	Generative	Multiple
Unit Test Generation	Generate unit tests for a given function or class	Generative	Multiple
Just-In-Time Comment Update (JITCU)	Update inline comments when the surrounding code changes	Generative	Multiple
Method Name Recommendation	Suggest a name for a method based on its body	Generative	Multiple
Protocol Buffer Transformation	Translate between Protocol Buffer schemas/representations	Generative	Multiple
Clone Detection	Determine whether two snippets are semantically equivalent	Non-generative	Multiple
Defect Detection	Classify code as buggy or correct	Non-generative	Multiple
Code Search	Retrieve code matching a natural-language query	Non-generative	Multiple
Cloze Test	Predict masked tokens in a code sequence	Non-generative	Multiple
Code Review (classification)	Classify a review comment or accept/reject a change	Non-generative	Multiple
Header File Prediction	Predict the correct header file(s) for a piece of code	Non-generative	Multiple
Method Name Consistency Check	Decide whether a method name matches its body	Non-generative	Multiple

Table 1. The 19 SE tasks covered by the 28 reviewed studies, categorized as generative (12) or non-generative (7).

4. RQ2 — Which Architectures and PEFT Methods?

Architectures

The architecture distribution across reviewed studies looks like this:

Encoder-decoder models are most prevalent: 29 task instances across 12 studies. CodeT5 alone is adapted 19 times across 8 studies, making it the single most frequently fine-tuned model in the corpus.
Encoder-only models: 22 task instances across 10 studies. Largely CodeBERT, GraphCodeBERT, and similar.
Decoder-only models: 19 task instances across 12 studies. The CodeLlama family shows particularly fast-growing adoption, and decoder-only models appear exclusively in generative tasks — an architectural fit, but also a real gap when it comes to classification-style SE problems.

PEFT Methods

On the method side, four families dominate the field, with one clearly leading:

Base LoRA is the dominant PEFT technique with 29 SE task instances.
Adapter family (Base Adapter, L-Adapter, T-Adapter, etc.): 23 task instances.
Prompt Tuning: 14 task instances.
Prefix Tuning: 13 task instances.

Figure 3 — PEFT Taxonomy

The reviewed PEFT methods organize into four structural families. Where each method appears, it is mapped to the tasks for which it has been used in the reviewed corpus:

Additive

Base Adapter — summarization, generation, translation, APR, defect & clone detection
L-Adapter — cross-lingual summarization, translation, code review
T-Adapter — task-specific adapters for summarization, generation, search
Prompt Tuning — generation, summarization, defect detection
Prefix Tuning — generation, summarization, refinement
P-Tuning — generation, classification
Pass-Tuning — structure-aware code understanding tasks
(IA)³ — lightweight adaptation for generative tasks

Reparameterized

Base LoRA — generation, summarization, translation, APR, completion, refinement, review generation, clone & defect detection
QLoRA — generation, summarization, APR (with quantized base models)
FF-LoRA — feed-forward-targeted LoRA for generative tasks
AdaLoRA — adaptive-rank LoRA, only sparsely evaluated

Selective

BitFit — bias-only tuning; sparsely evaluated in SE
Telly-K — layer-selective freezing for code understanding

Hybrid

MAM — mixed adapter + prefix configurations
L-Adapter + T-Adapter — cross-lingual + task adapters stacked
FF-LoRA + Adapter — combined reparameterized + additive
L-Adapter + NER-Adapter + AdapterFusion — multi-adapter fusion for code review

Figure 3. PEFT method taxonomy with SE tasks mapped to each method, organized into the four structural families.

Pattern worth noting

Reparameterized methods — LoRA and its variants — are the most-used and best-performing family in the reviewed corpus. Additive methods (adapters, prompts, prefixes) are well-represented but spread more thinly across tasks. Selective methods (BitFit, Telly-K) are essentially under-evaluated in SE.

5. RQ3 — How Does PEFT Compare to Full Fine-Tuning?

This is the question every practitioner actually cares about: does PEFT match full fine-tuning on quality, and does it actually save anything in practice?

For the generative tasks, the comparisons are encouraging:

~72% of comparisons reported performance improvements over full fine-tuning.
Over 54% reported efficiency gains as well (lower memory, fewer trainable parameters, faster training, or smaller checkpoints).

For the non-generative tasks:

~62% reported performance improvements.
~50% showed efficiency gains.

Across both categories, LoRA-based methods (Base LoRA and QLoRA) were consistently the strongest performers. They tend to match or beat full fine-tuning while training on the order of one to two percent of the original model's parameters.

Table 6 — Best-performing PEFT Methods vs. Full Fine-Tuning

Task category / task	Best-performing PEFT method(s)	Performance	Efficiency
Generative tasks
Code Summarization	Base LoRA, Adapter family	Improved	Gain
Code Generation	Base LoRA, QLoRA	Improved	Gain
Code Translation	Base LoRA, L-Adapter	Improved	Mixed
Automated Program Repair	Base LoRA, QLoRA	Improved	Gain
Code Refinement	Base LoRA, Prefix Tuning	Similar	Gain
Commit Message Generation	Adapter family	Improved	Gain
Code Review Generation	Adapter Fusion (hybrid)	Improved	Gain
Code Completion	Base LoRA	Improved	Gain
Unit Test Generation	Base LoRA	Similar	Gain
JITCU / Method Name Rec. / Protocol Buffer	Adapter family, Prompt Tuning	Improved	Mixed
Non-generative tasks
Clone Detection	Base LoRA, Adapter family	Improved	Gain
Defect Detection	Base LoRA, Prompt Tuning	Improved	Gain
Code Search	Adapter family, Pass-Tuning	Improved	Mixed
Cloze Test	Telly-K, Adapter family	Similar	Gain
Code Review (classification)	Adapter Fusion (hybrid)	Improved	Gain
Header File Prediction / Method Name Check	Base LoRA, Adapter family	Similar	Gain

Table 6. Best-performing PEFT methods reported across reviewed studies, with directional indicators relative to full fine-tuning. Improved = at least one reported metric exceeded full FT; Similar = comparable; Mixed = setup-dependent. LoRA-based methods dominate the “best method” column.

Important caveat

About 29% of the reviewed works did not include a direct comparison with full fine-tuning. That is a meaningful blind spot: when nearly a third of the literature skips the head-to-head, the “PEFT matches or beats full FT” story is statistically supported but methodologically thinner than it should be. Future work should make full-FT baselines mandatory rather than optional.

6. What's Still Missing

The same review that shows where PEFT in SE is strong also shows, very clearly, where it isn't yet:

Method coverage gaps. AdaLoRA, P-Tuning, and BitFit are largely unevaluated in SE despite being well-studied in NLP. We don't yet know whether their NLP wins transfer to code.
Task coverage gaps. Software testing, software documentation generation, and refactoring are notably underexplored. The corpus skews heavily toward summarization and generation, and away from large parts of the SE lifecycle.
Architectural asymmetry. Decoder-only LCMs are used only for generative tasks in the reviewed corpus — their potential for classification/matching SE tasks is unexamined.
Benchmark fragmentation. Studies often use different datasets, metrics, and base models, which makes apples-to-apples comparison genuinely hard. Standardized PEFT-for-SE benchmarks would do a lot of good here.

Section 6 of the paper sketches a six-step roadmap for future work that we think the community should treat as a checklist:

Cover the under-evaluated PEFT methods (AdaLoRA, P-Tuning, BitFit) on standard SE tasks.
Extend PEFT studies to under-explored SE tasks: testing, documentation, refactoring, security.
Evaluate decoder-only LCMs on non-generative SE tasks, not just generative ones.
Make full-fine-tuning baselines mandatory in PEFT-for-SE evaluations.
Standardize benchmarks, datasets, and efficiency reporting across studies.
Investigate hybrid configurations more systematically — the early hybrid results (e.g., AdapterFusion variants) suggest there is real signal there.

7. Bottom Line

PEFT in software engineering is not a workaround for not having enough GPUs. The reviewed evidence makes that clear: across 28 studies, 19 SE tasks, and four method families, PEFT consistently matches or improves on full fine-tuning while training a tiny fraction of the parameters. It's a robust optimization paradigm, not a compromise — and as LCMs continue to grow, it will be the default way to adapt them to SE tasks, not the exception.

If you're starting an SE project on top of an LCM today, the safest, most-evidenced first move based on this review is straightforward: start with Base LoRA or QLoRA on top of a CodeT5 or CodeLlama-family base model, and only deviate when you have a specific reason to. Everything else in the design space — adapters, prompts, prefixes, hybrids — is worth exploring, but is more sparsely evaluated in SE so far.

Read the full paper (ACM DL) Replication Package (GitHub)

A Systematic Literature Review of Parameter-Efficient Fine-Tuning for Large Code Models

Contents