Required Skills

bashdockergitllmpython

Job Description

We are building a large\-scale benchmark for evaluating the cybersecurity capabilities of frontier AI Large Language Models (LLMs). To grow this benchmark, we need hands\-on security engineers who can craft real\-world vulnerability tasks that are genuinely difficult for state\-of\-the\-art LLMs and agentic systems.

Your core output will be carefully designed benchmark instances: real software vulnerabilities paired with well\-formed task specifications and validated evaluation oracles that expose the limits of current AI systems and drive progress in AI safety research.

Key Responsibilities

**Create Benchmark Tasks:** Design cybersecurity benchmark tasks engineered to challenge and fail frontier LLMs.
**Environment Maintenance:** Build and maintain containerized benchmark environments using Docker, libFuzzer, and sanitizers (ASan/MSan).
**Develop Difficulty Tiers:** Produce multi\-level difficulty variants, ranging from Level 0 (no description provided) through Level 3 (patch diff supplied).
**Collaborate with Research:** Partner with researchers to analyze and document the specific failure patterns of AI agents.
**Technical Documentation:** Write clear, reproducible vulnerability descriptions ($\\le200$ words) to be used directly as task prompts.
**Agent Stress\-Testing:** Stress\-test developed tasks against frontier LLM agents (e.g., OpenHands, Codex CLI) and document their failure modes.
**Quality Assurance:** Ensure strict benchmark quality, including zero data duplication, sufficient locating information, and a 96%\+ precision target.
**Responsible Disclosure:** Follow standard responsible disclosure protocols for any zero\-day vulnerabilities discovered during benchmark development.

Required Skills \& Qualifications1\. Vulnerability Research Expertise

3\+ years of hands\-on experience identifying and analyzing memory safety vulnerabilities in C/C\+\+ codebases (e.g., heap/stack overflows, use\-after\-free, null dereferences, uninitialized memory).
Demonstrated ability to reproduce known CVEs and write proof\-of\-concept (PoC) inputs that reliably trigger sanitizer crashes (ASan, MSan, UBSan).
Comfort navigating large, unfamiliar codebases (ranging from 100k to 7M\+ lines of code) to locate vulnerable code paths.

2\. Fuzzing \& Toolchain

Working knowledge of coverage\-guided fuzzers such as libFuzzer, AFL\+\+, or OSS\-Fuzz workflows.
Experience compiling projects with sanitizer flags (AddressSanitizer, MemorySanitizer) using GCC or Clang.
Familiarity with Docker for building and distributing reproducible execution environments.

3\. Patch \& Exploit Analysis

Ability to read unified diffs and extract semantic meaning about the specific vulnerability being patched.
Solid understanding of 1\-day / N\-day attack workflows, moving successfully from a patch diff to a working PoC.
Experience with binary search over commit history (e.g., git bisect or equivalent) to pinpoint exact patch commits.

4\. Communication \& Automation Rigor

Ability to write concise, technically precise vulnerability descriptions (target: $\\le200$ words) containing sufficient localization info for reproduction without leaking the fix.
Comfortable scripting in Python or Bash to automate build, evaluation, and filtering pipelines.

Nice to Have (Preferred Skills)

Prior experience contributing to or evaluating AI coding agents (e.g., OpenHands, Codex CLI, SWE\-agent).
Familiarity with LLM APIs and prompt engineering for automated quality\-judgment pipelines.
A research background, including publications or detailed write\-ups on vulnerability discovery, fuzzing, or program analysis.
Direct experience with CVE reporting and coordinated vulnerability disclosure processes.
Knowledge of broader vulnerability classes beyond memory safety (e.g., logic flaws, cryptographic weaknesses, web/mobile vulnerabilities).
Hands\-on Capture the Flag (CTF) competition experience, particularly in *pwn* or *reverse\-engineering* categories.
Familiarity with symbolic execution or static analysis tools (e.g., angr, CodeQL, Infer).

Core Values We Look For

Strong curiosity and a research\-oriented mindset.
The ability to seamlessly translate theory into practical, functional systems.
High ownership, a bias toward execution, and comfort with ambiguity in evolving problem spaces.
Clear, highly structured technical communication.
Ability to thrive and maintain high autonomy in a fast\-paced environment.

Why Join This Project

Work on cutting\-edge problems at the intersection of AI evaluation, safety, and reliability.
Help bridge the gap between security research and real\-world AI systems.
Enjoy high ownership and autonomy in a fast\-moving team environment.
Opportunity to actively shape how AI agents are evaluated at scale while gaining exposure to both research\-driven innovation and production systems.

Pay: ₹552,138\.32 \- ₹830,854\.57 per year

Work Location: In person

Similar Jobs

Browse all jobs

Upload resume for AI match score

Job Overview

Job type: Full-time
Work mode: On-site
Location: Bengaluru
Posted: 3d ago
Source: Indeed

LinkedIn 𝕏 / Twitter

Cybersecurity Benchmark Engineer 3+ years

Required Skills

Job Description

Similar Jobs

Job Overview

Share