An experiment quantifying the importance of talking things through
Hiring is hard for everyone involved, and unfortunately, the way things are going, AI seems to only be making it harder. This report looks at how AI is influencing the current state of recruitment technology and then presents a fundamentally different approach to this technology – and to solving group planning problems in general – that we’ve developed at TenFive AI. A goal of this report is to frame situations where groups of humans need to collectively rank a set of options in terms of a quantifiable metric, and then use this metric to measure the difference in outcomes across different types of AI frameworks. We’ll demonstrate the advantage gained with the TenFive approach, which deploys LLMs as communicative – not reasoning! – agents tasked with using natural language conversation to explore solutions to an objective. Our approach exhibits significantly greater consistency in outcomes across variations of a candidate selection task, a highly desirable feature in a situation where eliminating arbitrariness in decision making is crucial.
AI has been presented by its proponents as an informational panacea, a solution to the deluge of data that piles up in our day-to-day lives. The idea is that the tremendous capacity that generative models have for processing data means that they can absorb and curate the flood of content that besets us anytime we expose ourselves to the digital world. AI can act as our companion, our filter, our buffer against what amounts to always too much information coming from everywhere all the time. It’s a nice idea, but the reality that is emerging around this technology is that it is a source of informational overload, not a solution to it: AI seems to simply supercharge the flow of content and blather, meaning that there is more of everything, and at the same time everything is more similar.
One area where this double shock of narrowing scope and increasing volume has become particularly evident is recruitment technology, where a kind of arms race is developing between employers equipped with automated applicant tracking and selecting systems on the one hand and, on the other, job seekers supported by AI tools for automatically generating their applications. The net result is the potential for a mode of algorithmic negotiation in which at least a significant part of the process of filling a role and getting a job reduces to effectively computer-against-compter gameplay. In this scenario, any kind of qualitative evaluation by and of the humans involved on either side of the process falls away.
Part of the problem with automating the recruitment process – but really only one part of a complex problem – is that the human version of the process is already prone to certain shortcomings. Ideally, humans faced with a large set of ostensibly qualified job candidates that need to be differentiated will approach their task with thoughtfulness, giving each candidate a fair review using the same criteria for all the candidates. Practically speaking, though, there is a reasonable chance that the relative and arbitrary position of a job application at the top, middle, or bottom of a stack of candidates being considered could result in different outcomes for the applicant, for any number of reasons.
Mimicry of imperfect human behaviour may be one part of the problem with AI solutions to informational overload, but it is not an excuse for developing processes that push humans further away from decisions about their own lives. At TenFive AI we’ve been working on an application of AI that pivots from the claim that models trained on large amounts of digital text somehow learn to “reason” about their outputs, and towards an application that treats models as reasonable simulations of the way that humans use linguistic communication to explore problems. In what follows, we’ll look briefly at how and why attempts to use AI to reason through communication problems like job applicant processing tend to lead to bad outcomes, and why the TenFive approach can resolve these issues. We’ll back this up with some quantitative experimental results where we show that TenFive AI committees are well suited to providing outcomes in hiring recommendations that are consistently representative of underlying human stakeholder intentions.
What could go wrong?
There are two big problems with the way that AI mimics human assessment of a job application, and they might seem almost contradictory, but in fact they both stem from the same underlying cause. The first problem is that AI tends to make very generic choices. This is a result of the way that AI always tends towards the average of the things it has encountered in its training data: its output in an ostensibly new situation is an extrapolation of whatever counts as “typical” in its priorly observed data. This can lead to results that are frustratingly predictable in the best of circumstances, but, much more problematically, it can and very often does lead to the reinforcement of underlying biases that lurk in training data. We’ve written elsewhere about how this remains a pernicious problem with AI involved in making human-oriented judgements, in particular around recruitment.
The second issue with AI assessing job applications is that outputs can also be arbitrary. This might seem like the opposite of an outcome being generic, but in fact it’s just a different manifestation of the same basic problem of AI performing surface level mimicry of a reasoning process rather than any deeper thought that aligns with the values that humans would like to bring to decisions around things like employment. Because the AI is simply outputting language that corresponds to a judgement that sounds like what a pretty reasonable person would say, we can get a fairly random assortment of results within the range of what counts as “pretty reasonable”. The net outcome of these two issues with data-driven modelling is that generative AI consistently hands over facile decisions that we could describe as generically arbitrary. Given a stack of job applicants, it will always pick an applicant who’s evidently gone through the motions of looking like a superficially good applicant; it just won’t pick the same one of these types of applicants every time, even when prompted with almost identical selection criteria.
At TenFive AI, we’ve developed a technology that simultaneously addresses both the genericness and the arbitrariness of outputs that emerge from the black boxes of data-driven AI models. Our tech doesn’t throw away these models, but it does use them for the thing that they’re actually conditioned to do, which is to simulate human communication, not human reasoning. To understand the difference between a communication problem and a reasoning problem, it’s helpful to think of the ways in which humans solve problems that involve collaboratively exploring options in order to understand the requirements of multiple stakeholders: we use language to do this. In fact, a great deal of the data that LLMs are trained on is reflective of precisely humans getting together and using words to push a problem out into the open, find out what everyone thinks, and hold this up against what’s possible.
In order to take advantage of this feature of generative AI, our platform builds up committees of AI agents in which different stakeholders involved in a communication problem get together to talk through a joint understanding of the problem and then work towards a solution. By turning the process of problem-and-solution discovery into a conversation, our tech pulls the solution space for the problem out of the impenetrable tangle of model parameters and puts it in the open in the form of a transparent and interpretable dialogue. Rather than having a model try to concatenate its own imitation of reasoning into a regressively compounding context, our committees emulate dynamic contributions to a discussion, where different agents can react to one another without losing their grounding in the basic ideas they’ve been assigned to represent. Furthermore, because a conversation is a process of discovery, the ideas that ground each agent manifest themselves in combinations of expressions that can’t be fully anticipated until the committee is actually in progress.
For the candidate selection phase of a recruitment process, an AI agent committee can be given access to batches of applications. Different agents are instantiated based on the selection criteria of the various human stakeholders involved in the candidate selection process, and these agents are given the chance to express opinions associated with the humans. By exploring the different ways the underlying requirements of stakeholders can be expressed as sentences, in the context of a conversation also made up of other sentences associated with the requirements of other stakeholders, the committee is able to collectively discover solutions to the candidate selection problem that were obscure when considered only from the perspective of each individual stakeholder, or from a generic amalgamation of all the perspectives.
An experiment
Our claim is that TenFive AI committees can overcome the generic arbitrariness that afflicts generative AI models tasked with performing “reasoning” tasks, by recasting these tasks as communication problems. In order to test this claim in the context of hiring decisions, we ran an experiment:
- We took a sample of 77 English-language job applicant profiles with titles including the word “accountant” from the Djinni Recruitment Dataset, a compilation of such profiles released for research and development purposes.
- We sketched out statements from imaginary colleagues in 4 hypothetical roles at a company seeking to hire an accountant, outlining each of the colleagues’ individual requirements and expectations for the hire.
- We divided the data into 8 different batches of candidates, consisting of 9 or 10 candidates in each batch. We performed this slicing of the data 10 times, reshuffling the data each time, resulting in a total of 80 batches of 9 to 10 candidates, with each candidate represented a total of 10 times in the data, in 10 different randomly assorted batches.
- We ran a baseline version of the experiment in which we fed each of the batches of 9 or 10 candidates to an LLM as part of a prompt, along with all of the requirement briefs for each of the 4 hypothetical colleagues and instructions to select the top candidate from each batch. Because each candidate is observed 10 times, the maximum number of votes a candidate can receive is 10. We use the tally of votes received by each candidate to rank all of the candidates comparatively. For this version of the experiment we used Google’s Gemini-Pro-1.5 model, described by Google as handling “complex reasoning tasks requiring more intelligence”. In the table below, we label this as the “all at once” set-up.
- We staged an additional baseline version of the experiment in which we ran an LLM four different times, once for each of the requirements briefs worked up for each hypothetical colleague involved in the hiring decision, with instructions to pick a single top candidate from each of the batches of 9 or 10 candidates. In this setting the maximum number of votes a candidate could receive is 40, four votes from each of the colleagues multiplied by the 10 different times each candidate will appear for consideration. Here again we used the Gemini-Pro-1.5 model. We call this the “separate models” set-up in the table below.
- We ran a version of the experiment using a TenFive AI committee, in which each of the four colleagues involved in the hiring decision is represented by an LLM instructed to negotiate over each of the batches of 9 to 10 different candidates based on the requirements specified by each colleague. In each negotiation, each LLM is tasked with casting a vote for a single candidate on behalf of the colleague being represented, but, crucially, this vote gets updated over the course of the negotiation, based on the conversational interaction with the models representing other colleagues. We did this experiment using the Gemini-Flash-1.5 model, a smaller model provided by Google to be “fast and versatile performance across a diverse variety of tasks” – at less than a tenth of the cost of the Pro model.
- We ran each of the above versions of the experiment twice, randomly shuffling the order of candidates within each batch for each of the runs. This means that for each run of the experiment, the sequence of information presented to an LLM for any given batch of candidates is different, but the underlying information, which is to say the set of candidates, is the same.
With this last point in mind, this is our research question for this overall experiment: what is the impact of arbitrary changes in order of information as presented in a prompt – but not the underlying content of each presentation of data – for each of the modelling set-ups? Given that models are receiving the same range of candidates in each batch they consider, just in different orders, we would want there to be a strong correlation between the overall rankings assigned to each candidate based on all the votes they receive across all the batches. (This may not be the case if we randomly swapped candidates between batches on different runs of the experiment, since we might then just happen to sometimes get good candidates compared to one another, and so competing for a reduced vote share, in the same batches on one run but not on another.)
The statistic we use to assess correlation in ranking between two different runs of the experiment is Spearman’s rank correlation, a metric designed to assign a coefficient reflecting the degree to which two different rankings of lists of the same underlying entities exhibit the same order as one another. A Spearman’s score of 1 indicates exact agreement in ranking between two lists, while a score of 0 indicates the lists are in no more agreement than we would expect with random order assignment. The closer the score is to 1, the more similar the orders of the items in the two different lists are. (Spearman’s coefficient can also take a negative value, with a score of -1 indicating lists in the exact opposite order.) Here are the results for our experiment:
| Spearman’s | top 3 | top 5 | |
| all at once | 0.537 | 1 out of 3 | 3 out of 5 |
| separate models | 0.673 | 1 out of 3 | 3 out of 5 |
| TenFive committee | 0.867 | 2 out of 3 | 4 out of 5 |
It is difficult to interpret values of Spearman’s coefficient in absolute terms: there is no obvious threshold for what counts as a “good” correlation, since the degree to which two versions of an ordered list are expected to correlate is highly dependent on the nature of the things being listed. As a comparative measure, though, the difference between the TenFive committee and the best of the two baselines – 0.867 versus 0.673 – indicates significantly stronger consistency in ranking candidates with the TenFive approach. The results for the all at once set-up in particular suggests a concerning level of apparent randomness in outcomes for data that from a human perspective should be considered very similar.
To get a more grounded sense of what this means, let’s think about a candidate selection process from a human perspective. The objective of filtering a stack of job applications is to come up with a small set of top-notch candidates to shortlist for more in-depth consideration. If the process is being done well, we would expect the same candidates to at least mainly come up regardless of variations in the order in which applications are presented. This is important to ensure that the employer criteria being used to make the selection are adequately represented, and the candidates themselves also have a reasonable expectation that they will be consistently judged based on the same criteria, not randomly picked out of a hat.
To get a more concrete sense of the consistency with which each of our experimental set-ups would make the same shortlist recommendations, we consider how many of the top 3 and top 5 candidates are consistent between the two lists generated for each of the shuffles of the data, for each of our experimental settings. As can be seen in the table, here too the TenFive committee approach is somewhat more consistent – though also not perfect. But for the other two non-communication approaches to automating candidate selection, the fact that only 1 candidate would make a top-3 shortlist between the two different runs is especially concerning.
Summing up
To sum up, hiring technology is at the front of a wave of AI-produced and AI-mitigated information overload, in which AI-based systems will be processing AI-generated output. If we allow these processes to take on the algorithmically structured characteristics of the systems which are increasingly a part of them, there is a risk that the essential human aspect of things like employment decisions will recede and eventually vanish. A feature of this era of automation is a situation in which outcomes that arise from data-driven models and the chains of processes in which they participate are increasingly generic and at the same time more and more arbitrary. In the case of the recruitment process in particular, this takes the form of more banal applications being met with more random selection processes.
None of this has not been said elsewhere, but it’s happening, anyway. At TenFive AI, we have developed a technology which uses AI, but in a way that focuses on the thing that AI built on large-scale natural language data has been conditioned to do: communicate. We show that by framing a candidate selection task as a communication problem, rather than as a reasoning problem, we can achieve much more consistent outcomes in the face of random alterations that change the order but not the underlying content of target data.
A positive feature of our approach that we haven’t spent time on here is that the conversational nature of the engagement between TenFive AI committees means that, in addition to finding various pathways towards the same outcomes, the steps taken to those outcomes are transparent and auditable. The ways in which the agents participating in one of our committees express the requirements of the humans they represent are completely understandable. This means we not only achieve a result that is defensibly regular; we can additionally defend these results by pointing to an explanation supported by sentences that track directly back to the requirements of the humans involved in the decision making process. We offer this up as an antidote to the evident randomness, boringness, and bias currently coming from some approaches to AI.