Skip to content
AI/ML

PenQWEN

Domain-adapted LLM reducing security assessment setup from 4 hours to 12 minutes

4h to 12min
Setup Time
Reconnaissance automation
12GB
Training Data
Curated cybersecurity corpus
3.6GB
Adapter Size
LoRA adapters on Qwen2.5-7B
60%
Automation Rate
Routine enumeration tasks

The Challenge

Penetration testing teams spend the first 4+ hours of every engagement on boilerplate reconnaissance: port scanning, service enumeration, vulnerability identification, and report scaffolding. Senior pentesters doing $200/hour work were wasting time on tasks that should be automated. General-purpose LLMs (GPT-4, Claude) produce plausible-looking but technically dangerous output—recommending tools that don't exist, generating commands with wrong flags, or suggesting techniques that violate scope agreements. The security domain requires extreme precision: a hallucinated Nmap flag could scan out-of-scope networks, and a fabricated CVE reference wastes hours of investigation time. No existing LLM solution understood OPSEC constraints, tool-specific syntax, or the structured methodology (PTES) that professional assessments follow.

The Approach

Built a two-stage fine-tuning pipeline on Qwen2.5-7B. Stage one: cybersecurity corpus adaptation using 12GB of curated data—MITRE ATT&CK techniques, CVE databases, tool documentation (Nmap, Burp Suite, Metasploit, BloodHound), and penetration testing methodology guides. This gives the model domain vocabulary and factual grounding. Stage two: agentic fine-tuning for structured tool calling with OPSEC awareness. Trained on real engagement workflows to output properly formatted commands, respect scope constraints, and flag when a requested action might violate rules of engagement. Used LoRA (Low-Rank Adaptation) to keep adapter size at 3.6GB—practical for deployment on consumer GPUs.

Tech Decisions

Base Model
Qwen2.5-7B

Best balance of capability and deployment size for domain-specific tasks. Larger models (70B) offer marginal accuracy gains but require multi-GPU setups. Qwen2.5's strong instruction-following and code generation capabilities provide a solid foundation for tool-calling fine-tuning.

Fine-Tuning
LoRA / PEFT

Full fine-tuning a 7B model requires 4x A100s and risks catastrophic forgetting. LoRA trains only 0.1% of parameters, produces a 3.6GB adapter instead of a 14GB full model, and runs on a single RTX 3080. Training completes in 8 hours versus 3+ days for full fine-tuning.

Training Pipeline
Two-Stage Curriculum

Single-stage training conflates domain knowledge with behavioral patterns. Stage one (corpus adaptation) builds factual grounding; stage two (agentic fine-tuning) teaches structured tool-calling and OPSEC constraints. This separation produces more reliable outputs than mixing both objectives.

Technical Challenges

The Solution

PenQWEN deploys as a 3.6GB LoRA adapter on top of Qwen2.5-7B, runnable on any GPU with 12GB+ VRAM. The model handles reconnaissance automation, vulnerability prioritization, and report generation following PTES methodology. It generates syntactically correct tool commands with proper flags, understands scope constraints, and refuses to suggest techniques outside the defined engagement rules. The two-stage training approach means the model has both factual knowledge (CVEs, techniques, tool syntax) and procedural understanding (when to use which tool, how to chain findings, OPSEC considerations). Currently automating 60% of routine enumeration tasks with zero hallucinated commands in production use.

Key Takeaways

Reusable Insights
  • Domain fine-tuning consistently beats prompt engineering for specialized tasks—a 7B fine-tuned model outperforms a 70B general model in its specific domain.
  • LoRA makes fine-tuning practical on consumer GPUs, democratizing domain adaptation that previously required cloud compute budgets.
  • Dataset quality matters more than quantity for domain-specific LLMs—12GB of curated, verified data outperforms 100GB of scraped, unverified content.
  • Two-stage curriculum learning (knowledge then behavior) produces more reliable outputs than single-stage training that conflates both objectives.
  • In safety-critical domains, hallucination prevention must be an explicit training objective, not an afterthought.

Related Projects

2025 / Developer Tools

TraceForge

Cut vector conversion time from 45 minutes to 8 seconds per asset—a 337x speedup. Design teams were hemorrhaging billable hours manually tracing logos and icons in Illustrator. Built a GPU-accelerated pipeline combining neural upscaling with dual vectorization engines (Potrace + VTracer), plus an SVGO optimization stage that reduces file sizes by 40-60%. Now processing 2,000+ conversions monthly with zero manual intervention.

PythonFastAPIPotraceVTracerCUDA
Case Study
2025 / Developer Tools

Claude Pilot

Recovered 2+ hours daily lost to context-switching between terminal, database clients, and config files. Claude Code power users were drowning in fragmented tooling—no unified view of sessions, memory state, or MCP server health. Architected a native Electron control center with 25 tRPC endpoints managing PostgreSQL, Memgraph, and Qdrant memory systems. 80% test coverage, zero production incidents since launch.

ElectronReactTypeScripttRPCZod
Case Study
2024 / AI/ML

PhotoKeep Pro

Slashed cloud GPU costs by 73% while boosting restoration quality by 4dB over commercial alternatives. A restoration startup was burning $12k/month on fragmented API calls with inconsistent results. Engineered a unified orchestration layer managing 14+ deep learning models (SUPIR, HAT, CodeFormer) with thread-safe VRAM allocation and LRU eviction across 49GB. Now delivering 28.5dB PSNR quality at 99.95% uptime—outperforming Magnific AI and Topaz on blind tests.

PythonFastAPIPyTorchReactTypeScript
Case Study
2025 / AI/ML

Voice Cloner

Built a production AI voice platform handling single-voice TTS, multi-speaker conversations, and full audiobook production from manuscript uploads — all on a single RTX 3080. The platform runs Qwen3-TTS 1.7B with 12-second P50 latency, 41+ curated voices, and zero-shot cloning from short reference audio. The Audiobook Studio parses DOCX/PDF/TXT manuscripts into chapters with dialogue detection, assigns character voices, applies pronunciation dictionaries, and exports distribution-ready M4B with chapter markers. Multi-voice conversations support drag-and-drop line ordering, per-line effects (speed, volume, gap), stage directions, multiple takes, ambient audio, and a waveform timeline editor. 99.95% uptime, 0.03% error rate, Stripe subscription billing.

PythonFastAPIPyTorchQwen3-TTSRedis
Case Study

Have a similar challenge?

I help teams solve complex technical problems. Let's discuss your project.

START_CONVERSATION()