We are building a large\-scale benchmark for evaluating the cybersecurity capabilities of frontier AI Large Language Models (LLMs). To grow this benchmark, we need hands\-on security engineers who can craft real\-world vulnerability tasks that are genuinely difficult for state\-of\-the\-art LLMs and agentic systems.
Your core output will be carefully designed benchmark instances: real software vulnerabilities paired with well\-formed task specifications and validated evaluation oracles that expose the limits of current AI systems and drive progress in AI safety research.
Key Responsibilities
- **Create Benchmark Tasks:** Design cybersecurity benchmark tasks engineered to challenge and fail frontier LLMs.
- **Environment Maintenance:** Build and maintain containerized benchmark environments using Docker, libFuzzer, and sanitizers (ASan/MSan).
- **Develop Difficulty Tiers:** Produce multi\-level difficulty variants, ranging from Level 0 (no description provided) through Level 3 (patch diff supplied).
- **Collaborate with Research:** Partner with researchers to analyze and document the specific failure patterns of AI agents.
- **Technical Documentation:** Write clear, reproducible vulnerability descriptions ($\\le200$ words) to be used directly as task prompts.
- **Agent Stress\-Testing:** Stress\-test developed tasks against frontier LLM agents (e.g., OpenHands, Codex CLI) and document their failure modes.
- **Quality Assurance:** Ensure strict benchmark quality, including zero data duplication, sufficient locating information, and a 96%\+ precision target.
- **Responsible Disclosure:** Follow standard responsible disclosure protocols for any zero\-day vulnerabilities discovered during benchmark development.
Required Skills \& Qualifications1\. Vulnerability Research Expertise
- 3\+ years of hands\-on experience identifying and analyzing memory safety vulnerabilities in C/C\+\+ codebases (e.g., heap/stack overflows, use\-after\-free, null dereferences, uninitialized memory).
- Demonstrated ability to reproduce known CVEs and write proof\-of\-concept (PoC) inputs that reliably trigger sanitizer crashes (ASan, MSan, UBSan).
- Comfort navigating large, unfamiliar codebases (ranging from 100k to 7M\+ lines of code) to locate vulnerable code paths.
2\. Fuzzing \& Toolchain
- Working knowledge of coverage\-guided fuzzers such as libFuzzer, AFL\+\+, or OSS\-Fuzz workflows.
- Experience compiling projects with sanitizer flags (AddressSanitizer, MemorySanitizer) using GCC or Clang.
- Familiarity with Docker for building and distributing reproducible execution environments.
3\. Patch \& Exploit Analysis
- Ability to read unified diffs and extract semantic meaning about the specific vulnerability being patched.
- Solid understanding of 1\-day / N\-day attack workflows, moving successfully from a patch diff to a working PoC.
- Experience with binary search over commit history (e.g., git bisect or equivalent) to pinpoint exact patch commits.
4\. Communication \& Automation Rigor
- Ability to write concise, technically precise vulnerability descriptions (target: $\\le200$ words) containing sufficient localization info for reproduction without leaking the fix.
- Comfortable scripting in Python or Bash to automate build, evaluation, and filtering pipelines.
Nice to Have (Preferred Skills)
- Prior experience contributing to or evaluating AI coding agents (e.g., OpenHands, Codex CLI, SWE\-agent).
- Familiarity with LLM APIs and prompt engineering for automated quality\-judgment pipelines.
- A research background, including publications or detailed write\-ups on vulnerability discovery, fuzzing, or program analysis.
- Direct experience with CVE reporting and coordinated vulnerability disclosure processes.
- Knowledge of broader vulnerability classes beyond memory safety (e.g., logic flaws, cryptographic weaknesses, web/mobile vulnerabilities).
- Hands\-on Capture the Flag (CTF) competition experience, particularly in *pwn* or *reverse\-engineering* categories.
- Familiarity with symbolic execution or static analysis tools (e.g., angr, CodeQL, Infer).
Core Values We Look For
- Strong curiosity and a research\-oriented mindset.
- The ability to seamlessly translate theory into practical, functional systems.
- High ownership, a bias toward execution, and comfort with ambiguity in evolving problem spaces.
- Clear, highly structured technical communication.
- Ability to thrive and maintain high autonomy in a fast\-paced environment.
Why Join This Project
- Work on cutting\-edge problems at the intersection of AI evaluation, safety, and reliability.
- Help bridge the gap between security research and real\-world AI systems.
- Enjoy high ownership and autonomy in a fast\-moving team environment.
- Opportunity to actively shape how AI agents are evaluated at scale while gaining exposure to both research\-driven innovation and production systems.
Pay: ₹552,138\.32 \- ₹830,854\.57 per year
Work Location: In person