We are seeking a highly skilled AI Quality Engineer to join our National Large Language Model (LLM) Project. This key role will focus on establishing and implementing robust data quality frameworks, evaluation methodologies, and quality gates throughout the LLM development lifecycle. The ideal candidate will ensure our Arabic LLM meets the highest standards of performance, reliability, and cultural appropriateness before being deployed to 20,000 government employees.
Key Responsibilities :
- Design and implement comprehensive data quality frameworks specific to Arabic language datasets for LLM training and evaluation
- Establish and enforce quality gates at each project phase (data preparation, model training, evaluation, and RAG implementation)
- Develop detailed acceptance criteria for each phase gate requiring formal sign-off from key stakeholders
- Create and implement quality metrics for data annotation, achieving >
90% inter-annotator agreement and >
95% cultural / contextual accuracy
Design and maintain data pipeline quality assurance processes for Arabic text normalization, diacritics standardization, and dialect variation mappingImplement Arabic-specific tokenization optimization with >98% vocabulary coverage and >
95% morphological accuracy
Develop comprehensive RAG quality measurement frameworks covering both retrieval metricsEstablish automated monitoring systems for continuous quality assessment with real-time dashboardsCreate and enforce testing protocols for model evaluation across various Arabic language tasksImplement robust regression testing frameworks to ensure model updates maintain or improve quality metricsDevelop protocols for bias detection and mitigation in both training data and model outputsSupport the implementation of benchmarking against global standardsDesign human evaluation frameworks to assess model outputs qualitativelyCollaborate with data annotation teams to ensure high-quality ground truth dataParticipate in weekly quality committee meetings and bi-weekly RAG performance reviewsCreate and maintain quality documentation including processes, guidelines, and acceptance criteriaRequirements :
Bachelor's or Master's degree in Computer Science, AI, Machine Learning, or related field4+ years of experience in AI / ML quality assurance, with specific focus on natural language processingStrong understanding of LLM evaluation methodologies and benchmarking techniquesExperience establishing quality gates and acceptance criteria for AI systemsHands-on experience with data quality frameworks and validation techniquesExperience implementing multi-level annotation review processes with clear metricsProficiency in designing data pipeline quality assurance systems for Arabic language processingExperience with RAG quality assessment covering both retrieval and generation componentsAbility to establish and track performance metrics against benchmarksExperience implementing automated testing frameworks and continuous integration for ML systemsStrong knowledge of bias detection and fairness assessment in AI systemsFamiliarity with Arabic language and NLP challenges specific to Semitic languagesExperience with human evaluation protocols and annotation quality assessmentProficiency in Python and relevant testing / quality assurance librariesUnderstanding of statistical analysis techniques for model evaluationExperience with data annotation platforms and quality control mechanismsKnowledge of responsible AI practices and ethical considerationsPreferred Qualifications :
Experience with LLM evaluation specifically for government or enterprise applicationsKnowledge of Arabic-specific LLM benchmarksExperience with RAG system evaluation and quality assuranceFamiliarity with platforms like Scale AI, Humanloop, or other annotation / evaluation systemsExperience with hallucination detection and factual consistency verificationKnowledge of prompt engineering and prompt quality assessmentExperience with MLOps and quality gates in CI / CD pipelines for MLProficiency with data lineage tracking and documentationExperience implementing A / B testing frameworks for model comparisonFamiliarity with user experience testing for AI applicationsExperience with security and privacy testing for AI systemsKnowledge of ROUGE, BLEU, BERTScore, and other NLP evaluation metricsExperience creating custom metrics for domain-specific tasksExperience participating in quality governance committees#J-18808-Ljbffr