Diagnosing CI/CD Failures and Recommending Repair Diff Footprints with Lightweight LLM-Assisted Models: Full Experimental Evaluation on BugSwarm CI Fail–Pass Pairs
Keywords:
CI/CD, DevOps, build failure diagnosis, build repair, Top-k recommendation, diff footprint, BugSwarm, log mining, automated program repair, large language modelsAbstract
CI/CD pipelines execute builds and tests on every change, generating high-volume failure signals and log artifacts that require rapid diagnosis and repair. This paper studies two operational tasks that appear in modern DevOps automation: (i) failure diagnosis and (ii) repair recommendation in a Top-k setting. We conducted full experimental evaluations on a public BugSwarm-derived artifact list containing 325 Java fail–pass CI pairs (SHA-256: 267fdfc1ee603af3613db96ec79230c7c2e856fa5b4594ffda6ec51f38809df6). We defined three failure types from CI outcome metadata—TEST_FAIL, BUILD_OR_SETUP_FAIL, and NONTEST_FAIL—and defined repair targets as diff-footprint patterns A{a}C{c}D{d} based on the numbers of added/changed/deleted files, mapped to 24 classes (23 frequent patterns with support ≥3 plus OTHER). Using only information available at failure time (commit message, repository metadata, build system, test framework, and test counters), we compared six baselines and a proposed probability-averaging ensemble of text-only and numeric-only multinomial logistic regression. On failure diagnosis (5-fold stratified CV), the proposed ensemble achieved macro-F1=0.881 and accuracy=0.926. On repair recommendation (3-fold stratified CV), the proposed ensemble achieved Hit@1=0.511, Hit@3=0.754, Hit@5=0.852, Hit@10=0.920, and MRR=0.657. The manuscript reports detailed tables and figures and specifies all hyperparameters to make the reported empirical findings reproducible.

