When researchers at the University of Maryland, the National University of Singapore, and Ohio State University asked seven leading commercial and open-source LLMs to pick the better of two resumes, one human-written and one the model itself had generated, every model preferred its own writing. The paper's headline finding: across the major commercial and open-source models tested, the self-preference rate ran between 67% and 82%.
The human-written resumes were not worse. The researchers controlled for content quality by having human judges grade the resumes for clarity and coherence, then re-ran the analysis. The bias persisted across the major models even with quality controlled. Every model kept picking its own version even when human reviewers rated the human-written one as clearer, more coherent, and more effective.
This is from a peer-reviewed working paper presented at the 5th ACM Conference on Equity and Access in Algorithms (EAAMO '25) in November 2025, currently in revision and under review at Manufacturing & Service Operations Management. The authors call it self-preference bias. The mechanism is simple: large language models learn to recognize their own stylistic fingerprints, word choice, sentence rhythm, and structural patterns, and reward content that matches them.
Then the researchers ran the simulation that should change how every Head of Talent thinks about screening. Same job. Twenty-four occupations. Identical qualifications. The only variable was whether the candidate used the same AI as the screener. Candidates who matched the screener's model were 23% to 60% more likely to be shortlisted than equally qualified applicants who wrote their own resumes.
The worst gaps showed up in sales, accounting, and finance. The exact roles mid-market companies hire most.
If your reaction is "that was August 2025, the models have moved on," the follow-up research is the part that should worry you. In September 2025, a separate research team from Harvard and Cangrade (Lehr, Cipperman, and Banaji) published Extreme Self-Preference in Language Models, testing GPT-4o, Gemini-2.5-Flash, and Claude Sonnet 4. Across roughly 20,000 queries, the same self-preference pattern held. The authors describe several of the observed effects as "exceptionally large, at levels rarely seen in human data."
The follow-up study went further. By manipulating what each model believed about its own identity, the researchers found that self-preference followed the assigned identity rather than the true one. The bias was not just about stylistic recognition. It was about model identity shaping judgment, including in consequential evaluations of job candidates, security software proposals, and medical chatbots.
A separate study (Saraf et al., August 2025) tested ChatGPT-4o, Gemini 2.5 Flash, and Claude Sonnet 4 by attributing each model's blog posts to different authors. False attribution produced "shifts of up to 50 percentage points in voting outcomes," in the paper's words. The "Claude" label consistently elevated scores regardless of actual authorship. The "Gemini" label systematically depressed them. The bias does not disappear with each new model generation. It shifts shape.
The category problem
If you run AI resume screening on AI-written resumes, you are not screening candidates. You are scoring whose subscription matches your vendor's.
This is not a bias in the demographic-fairness sense. It is not something a better prompt or a bias audit fixes. It is a structural property of any architecture that uses an LLM to evaluate text the candidate generated with an LLM. The signal itself is corrupted.
Greenhouse's 2025 candidate research found that 74% of US job seekers now use generative AI somewhere in their job search. Their 2025 AI in Hiring Report, conducted by Greenhouse (an ATS vendor that itself sells AI hiring tools), found that 41% of US job seekers have used prompt injection, hidden text designed to manipulate AI screens, to game filters. Another 52% who have not used it are considering it. Greenhouse detects prompt injection in 1 in 100 of the 300 million resumes it processes annually.
On the employer side, somewhere between 48% and 83% of companies (depending on whose survey) now use AI in resume screening. The Mobley v. Workday collective action, certified in May 2025, alleges that Workday's AI screening tools systematically rejected applicants over 40. In her certification ruling, Judge Rita Lin noted that Workday itself speculated the collective could be "in the hundreds of millions," which she treated not as a reason to deny notice but as a measure of the breadth of the alleged conduct. Roughly 14,000 plaintiffs had opted in by the time the window closed on March 7, 2026. A separate January 2026 class action against Eightfold AI alleges that the platform scraped personal data on over one billion workers and ranked applicants on a zero-to-five scale, filtering out low-ranked candidates before any human reviewed their applications.
Two sides of the same conversation are now mediated by language models that systematically favor themselves. The phrase being used for this in the recruiting community is the AI doom loop. The Greenhouse CEO gave it to Fortune. The Greenhouse data underneath it is more damning than the slogan: only 8% of job seekers believe AI screening is fair. Seventy percent of candidates were never told ahead of time that AI would evaluate them. Thirty-eight percent of US candidates have walked away from a hiring process specifically because it included an AI interview.
Why scoring better will not fix it
The instinct in talent acquisition is to assume that if a tool produces bad outputs, you need a better tool. A better prompt. A better model. A better bias audit.
That instinct is wrong here, and the research explains why. The University of Washington tested AI resume screeners across nine occupations in 2024 and found that they preferred white-associated names 85% of the time and never preferred Black male-associated names over white male-associated names. A 2025 study across GPT, Gemini, Claude, and LLaMA found persistent intersectional bias that flipped direction depending on the role. The "Fairness Is Not Enough" paper from July 2025 coined the Illusion of Neutrality: models that pass surface-level fairness audits often turn out to be doing keyword matching rather than substantive evaluation.
The self-preferencing finding adds a different kind of failure on top of demographic bias, and it has now been replicated across multiple frontier models that have shipped since the original paper. Even when the demographic distribution is controlled, even when content quality is controlled, even when human raters disagree with the model's judgment, the model still picks its own writing. Upgrading the underlying model does not solve this. It just changes which candidates get preferred.
You can run a bias audit on a self-preferencing screener and the audit will pass on race and gender. The screener will still be rewarding candidates whose ChatGPT subscription matches your vendor's API key. There is no fix in the scoring layer. The signal is the problem.
The regulatory environment is moving in the same direction the research is pointing. California's FEHA AI regulations went into force October 1, 2025. They require four-year recordkeeping for automated decision systems, treat the absence of anti-bias testing as evidence in discrimination claims, and adopt the agent-liability theory at the heart of Mobley v. Workday. Illinois HB 3773 takes effect January 1, 2026. The EU AI Act classifies recruitment AI as high-risk; the operative obligations apply from August 2, 2026, with penalties up to €15M or 3% of global turnover. The NIST AI Risk Management Framework, which every serious enterprise procurement now references, is built around impact assessment, transparency, and human oversight, not better scoring.
Every framework that survives penalizes ranking and auto-rejection. Every framework rewards verification, human oversight, source attribution, and recordkeeping.
A different architecture
The way out is not a better LLM scoring the same broken signal. It is moving the screening signal out of candidate-controlled text entirely.
Self-preference bias requires that the candidate's submission be text the LLM can stylistically recognize. A real-time voice conversation breaks this assumption at the source. The candidate cannot have GPT-4o speak for them in a live, requirement-anchored conversation about their actual experience. A reference cannot be ghostwritten by Claude when the reference is having a voice conversation about a specific claim the candidate made about themselves. The screening signal sits outside the specific attack surface Xu, Li, and Jiang documented.
This is exactly the architecture we built Virvell on, and we built it before we knew this paper would exist. We do not score resumes. We do not rank candidates. We do not auto-reject anyone. We bundle three independent screening signals: an AI pre-screen voice interview with the candidate, voice AI reference checks with people who actually worked with them, and verified background checks. Every claim gets anchored to a specific job requirement. We call the cross-verification layer Evidence Intelligence. Every requirement on the job description gets mapped to the sources that confirm or contradict it, with attribution.
Here is what that looks like in practice. In March 2026, Vena Medical, a 17-person Y Combinator-backed medtech company in Kitchener, posted a niche Catheter Assembler role. 123 applications arrived in days. The last time they tried to hire this role on Indeed, 500 applications had landed in 72 hours and sat unreviewed because the founders had no bandwidth to filter them. CEO Michael Phillips's description of that earlier round was direct: "We tried posting this job on Indeed, and we got, within three days, 500 applications. And I couldn't filter through them. It was just impossible." This time, our platform ran structured voice pre-screens on all 123 candidates. 78 candidates completed the screen. 44 finished in the first three hours. Reports were generated in an average of 26 seconds per candidate. Three candidates were shortlisted. One was hired. Total client time spent on screening: zero hours.
The interesting number was not the throughput. It was the completion rate split. On outbound calls, 37% of candidates completed the screen. On inbound callbacks, where the candidate chose the moment, 98% completed. A 2.6× lift on the same screen, same questions, same AI. Eight pre-screens landed after 6pm Toronto. One wrapped just past midnight. Those are candidates the 9-to-5 phone screen never reaches. We call this Inclusion Arbitrage, and we wrote about it at length in a separate post.
For the two candidates the AI flagged with a "proceed" recommendation, the pre-screen analysis surfaced their verified skills against the role requirements (cleanroom experience, GMP compliance, microscopic assembly, catheter assembly), noted any gaps, and recommended human review on specific items. The founders reviewed the structured reports, drew finalists from across confidence tiers, and made the call themselves. Background verification through our Certn integration completed the picture. That is the system working as designed: AI gathers the structured information, humans decide. Always.
(Voice has its own legitimate bias risks. Word-error rates differ across accents and dialects, and disability accommodation has to be a first-class concern, not an afterthought. We test for ASR accuracy across accent groups, offer human-alternative paths on request, and keep humans in the decision loop rather than treating any one signal as definitive. The architecture is structurally outside the self-preferencing attack surface. It is not magically immune to every other bias problem in machine learning, and we do not pretend it is.)
This architecture is also closer to where every regulator is pointing. No scoring means no proxy for protected attributes to leak into. No auto-rejection means no automated adverse employment decision under FEHA. Source attribution means every conclusion in the report is traceable, which is what the EU AI Act calls transparency and what the NIST AI RMF calls explainability. Human-in-the-loop is not a policy posture. It is the only way the system works.
What this changes for talent acquisition
A few things follow from taking the self-preferencing research seriously.
First, the resume is no longer the screening artifact. It is the candidate's opening claim, which now needs to be tested against independent evidence before any decision gets made. HireRight's 2025 Global Benchmark Report, surveying more than 1,000 HR and talent acquisition leaders, found that 75% of organizations uncovered candidate discrepancies during screening in the past 12 months, with 13% finding at least one discrepancy in every five candidates screened. HireRight's own analysis cited the rise of generative AI tools as a key driver, noting that they make it "easier for candidates to misrepresent themselves." The resume is best understood as a marketing document, not an information source.
Second, vendor selection has to start with a different question. Not "how good is the AI at scoring resumes." The right question is "what is the screening signal, and can the candidate fake it with an LLM." If the answer is text-on-text, the architecture has the self-preferencing problem regardless of how good the model is. And it has a trust problem regardless of how good the marketing is: 8% of job seekers believe AI screening is fair, and 38% walk away from AI interviews entirely. Mid-market companies competing for talent cannot afford to lose more than a third of their pipeline to a process design decision.
Third, the compliance posture has flipped. The riskiest hiring system you can run in 2026 is one that uses an LLM to score resumes and auto-rejects below a threshold. The most defensible system is one with no scoring, no auto-rejection, requirement-level source attribution, and a documented human decision step. This is the difference between being on the cover of the next Mobley-style class action and being the system your employment counsel points to as the model.
The bottom line
The screening signal you choose determines whether the next two years of AI hiring research and regulatory action work for you or against you.
If your screening signal is candidate-written text, scored by an LLM, the self-preferencing research cited in this post describes your system. The bias gets worse with each model generation. The regulatory exposure gets worse with each new state law and each new collective certification. The candidate trust gap is already 92 percentage points wide and not closing.
If your screening signal is a real-time voice conversation, cross-verified against references and background checks, anchored to job requirements with source attribution and a documented human decision step, you are operating outside the specific defect the research documents. You are also operating inside the posture every serious regulatory framework is now defining as defensible.
Virvell was built for the second architecture. Our AI Acceptable Use Policy is published at virvell.ai/ai-acceptable-use. The full Vena Medical case study sits at virvell.ai/case-studies/vena-medical.
If you run talent acquisition at a company where applicant volume per role has gotten unmanageable, where your team is leaning harder on AI to screen, and where you have started to wonder whether the AI screening stack you are running is part of the doom loop or the way out of it, this is a conversation worth having now, not after it has become a deposition. Book a 30-minute walkthrough at calendly.com/julien-virvell-ai/meeting-with-julien. We will show you exactly how Evidence Intelligence handles the failure mode this post describes, using your own job requirements as the test case.
We would rather not be having this conversation with you for the first time in a deposition.
See how Evidence Intelligence handles this failure mode
Book a 30-minute walkthrough and we will show you exactly how it works, using your own job requirements as the test case.
Book 30 Minutes