Job Title : AI Agent Evaluation Engineer

Experience Level : 5- 7 Years (Minimum 6+ years in Software QA required)

Work Distribution : ~70% Automation Testing / ~30% Manual Testing

Primary Focus Areas : Responsible AI, Safety Evaluations, and Google ADK

Role Summary :

We are seeking a seasoned QA professional to lead evaluation and testing efforts for AI and LLM-based systems.

The ideal candidate will have deep experience in validating conversational agents, conducting safety and adversarial assessments, and working with Googles Agent Development Kit (ADK) and Vertex AI ecosystem.

Mandatory Requirements :

AI / LLM Testing Expertise :

- At least 2 years dedicated experience testing or evaluating AI systems, conversational agents, or large language models (LLMs).

- Proven track record of designing, executing, and reviewing AI evaluation frameworks and test suites, including both automated and exploratory evaluations.

Safety & Red Teaming :

- Hands-on experience with safety evaluations is mandatory.

- Demonstrated practice in red teaming, adversarial testing, jailbreaking, toxicity/bias measurement, and other responsible AI safety assessments.

Google ADK Knowledge :

- Must have direct experience with or strong conceptual understanding of the Google Agent Development Kit (ADK) and its role in building and evaluating AI agents.

- Familiarity with Vertex AI (including Agent Builder and related evaluation services) is required.

Technical Requirements :

Core Programming & Automation :

- Strong proficiency in Python for test scripting, automation frameworks, and data manipulation.

- Practical experience with PyTest for creating robust automated test suites.

AI Safety & Prompt Challenges :

- Experience identifying and implementing tests for prompt injections, adversarial inputs, jailbreak scenarios, and robustness checks.

Evaluation & Tooling Familiarity :

- Candidates should be familiar with at least some of the following AI evaluation tools, libraries, or frameworks :

1. Langsmith

2. DeepEval

3. Ragas

4. Giskard

- Hugging Face evaluation tools

- These tools support structured testing and performance evaluation for LLMs and agent systems