AI-Powered Recruitment: Reducing CV Screening Time by 97% with Reliable LLM Evaluation

Why this matters
In today's competitive hiring landscape, HR teams face an overwhelming challenge: properly reviewing a CV takes 5-30 minutes, yet organizations receive 100-400 applications daily. This creates an impossible workload of 10-50 hours just for initial screening. Pre-AI solutions like keyword filtering reduced review time to mere seconds but sacrificed quality by missing crucial factors like learnability and cultural fit. At FlytBase, we built an AI-native recruitment solution that maintains human-quality assessment while dramatically reducing manual effort, transforming what seemed like an insurmountable daily task into a streamlined, reliable process.
The Problem
Traditional CV screening presented multiple bottlenecks:
- Manual review required 5-30 minutes per CV, creating an impossible workload of 10-50 hours daily
- Keyword-based filtering systems reduced this to 6-7 seconds but led to significant problems:
- Candidates "keyword stuffing" their CVs
- Focus shifted to keyword matching rather than holistic evaluation
- Numerous "cloak CVs" with listed skills but no projects or experience to back up claims
- Critical factors like learnability and cultural fit were completely missed
- Traditional ATS systems couldn't evaluate nuanced qualities or provide proper context
The AI-native Solution
We created an LLM-powered CV evaluation pipeline that not only automates screening but matches human-level accuracy after calibration. The system:
- Analyzes CVs against detailed criteria including project complexity, skill alignment, and certifications
- Uses separate "parser" and "decider" agents to ensure consistent evaluation
- Implements four specialized decider agents based on experience level (entry, junior, mid, senior)
- Applies temperature control, multiple evaluation runs, and human review flags for edge cases
- Achieves 97% reduction in manual review needs with accuracy matching human evaluators after iterations
Step-by-Step Breakdown
Step 1
Discover LLM Limitations and Optimize Parameters
- Initial AI implementation showed high score variations (up to 60 points) for the same CV
- Tested different sampling temperature settings to balance determinism and analysis capabilities
- Found optimal temperature range (0.2-0.3) to maintain reasoning while limiting hallucination
- Identified that temperature alone wasn't sufficient to solve inconsistency issues
Step 2
Divide and Conquer with Specialized Agents
- Created a two-agent system in N8N workflow: parser and decider
- Parser agent (temperature = 0) extracts structured data from unstructured CVs with complete determinism
- Verified parser output consistency with nearly identical results across multiple runs
- Decider agent evaluates the structured data against detailed criteria
- Eliminated inconsistencies by ensuring the same structured input for every evaluation
Step 3
Design Experience-Specific Evaluation Criteria
- Created four distinct decider agents for different experience levels:
- Entry-level agent: Focuses on internships, academic projects, and learning potential
- Junior analysis agent: Evaluates early professional experience and skill application
- Mid-level analysis agent: Assesses deeper domain expertise and project complexity
- Senior analysis agent: Looks for leadership evidence, domain mastery, and "war stories"
- Each agent uses specialized criteria matching FlytBase's expectations for that level
- Implemented detailed prompts that mimic how human recruiters evaluate different experience levels
Step 4
Implement Reliability Safeguards
- Run each CV through the system three times (optimal balance between cost and reliability)
- Calculate median score across multiple runs to eliminate outliers
- Track score range across runs to identify inconsistency
- Flag CVs with a score variation greater than 4 points for human review
- Randomly sample results from different score bands for verification by human hiring managers
- Iterate on prompts based on human feedback until alignment was achieved
What Changed
Our AI-native recruitment system achieved remarkable results:
- Evaluated over 800 CVs with consistent, reliable scoring
- Reduced maximum score variation from 60 points to just 10 points (83% improvement)
- Required human review for only 22 out of 800 CVs (97% reduction in manual effort)
- After iterations and tweaks, achieved alignment with human hiring manager evaluations
- Enabled evaluation of longer, more detailed CVs without increasing workload
- Shifted focus from "reduce CV length" to "include more context" for better evaluation
What We Learned
1. LLM parameter understanding is crucial - Learning about temperature controls and how they influence outputs is essential for building reliable systems.
2. Ambiguity is the enemy of consistency - The less ambiguity in prompts, the more consistent the results. Treat LLMs like junior team members who need explicit instructions.
3. Prompt engineering is product development - Creating effective prompts is not a side task but a primary engineering challenge requiring iteration and testing.
4. Specialized agents outperform generalists - Breaking tasks into specialized agents with narrow focus areas dramatically improves reliability.
5. Testing revealed Claude 3.7 outperformed GPT-4 - After testing the same prompts with multiple models, Claude 3.7 provided the most consistent results for our use case.
What You Can Steal
1. Divide complex AI tasks into specialized agents – Don't try to solve everything with one prompt. Our two-agent approach (parser + decider) dramatically improved consistency.
2. Control LLM determinism with temperature settings – Use near-zero temperatures for parsing/extraction tasks and 0.2-0.3 for nuanced evaluations requiring some flexibility.
3. Design prompts like you're training a junior colleague – Be explicit about steps, criteria, and thought processes. Leave no room for ambiguity in what you're asking the LLM to evaluate.
4. Run multiple evaluations and take the median – Running the same input multiple times (we found three to be optimal) and taking the median score drastically reduces outliers.
5. Implement human review flags for edge cases – Set clear thresholds (like score variation > 4 points) to trigger human review only when necessary.
Tools Used
- N8N - For orchestrating the workflow between parser and decider agents
- Claude 3.7 - Primary LLM after testing showed superior consistency vs. GPT-4
- Custom prompt framework - Structured with specific evaluation criteria for different experience levels
Final Thought
Working with AI isn't about building a perfect system on the first try. It's about systematic identification and elimination of inconsistency sources. While we haven't yet implemented RAG (Retrieval-Augmented Generation) for organizational context, our current approach demonstrates that even without it, well-engineered prompts can achieve human-level evaluation accuracy. Don't abandon AI when you encounter inconsistency – that's precisely when you should lean in deeper to understand the underlying mechanics and refine your approach.
AI Tools
Explore More Case Studies
Transforming workflows with AI-driven efficiency and precision.