Posted on: 26/11/2025
Description :
Job Title : ML Engineer RL Environments
About the Role :
We are looking for a highly autonomous Machine Learning Engineer who can design and implement SWE-Bench-style RL environments and generate a continuous stream of evaluation tasks for LLMs and agentic systems.
This role is heavily engineering-focused and requires strong experience building custom environments, workflows, and code-based tasks.
You will work directly with a senior researcher but execute independently.
Responsibilities :
- Build custom RL environments inspired by SWE-Bench, code-debugging tasks, unit-test-driven workflows, and agent evaluation tasks.
- Create large volumes of structured tasks, including :
- Code reasoning tasks
- Multi-step workflows
- Debugging challenges
- Reward-driven evaluation episodes
- Define state/action/reward formats for each environment.
- Implement task infrastructure in Python
- Produce JSON schemas, templates, and reproducible task scripts.
- Build testing harnesses to validate correctness of tasks.
- Work closely with a researcher to align on quality, difficulty, and output structure.
- Stay current with LLM evaluation and agentic frameworks.
Required Skills :
- 4 to 5+ years of ML Engineering experience (title must reflect ML Engineer / RL Engineer / ML Research Engineer).
- Strong Python engineering background building production-ready code and modular libraries.
- Experience with RL environment creation (Gym, Gymnasium, custom RL tasks).
- Experience with SWE-Bench, code evaluation, repo-based tasks, or similar systems is a major advantage.
- Strong understanding of reward shaping, episode design, and environment logic.
- Hands-on ML experience (PyTorch, TensorFlow, HF).
- Ability to independently generate tasks without supervision.
- Strong familiarity with LLMs and evaluation frameworks.
Nice to Have
- Prior work with LLM agent frameworks.
- Experience building debugging/patching tasks.
- Research engineering experience.
What success looks like
- You can independently produce new RL tasks daily.
- You write clean, reusable environment code.
- You understand how LLMs fail and design tasks to measure that.
- You need minimal oversight.
Did you find something suspicious?