Why is AI hallucinating?

Recently I started to try to wrap my head around this question. As it turns out, my history on face detection and hand gesture detection might have given me a clue on why the current LLMs have a tendency to hallucinate and provide false confidence.

Warning: This might get more mathematical than you’re used to or prepared for. But don’t be afraid. I’ll try my best to make this understandable.

Back in university we worked on improving the capabilities of a free-wheeling robot with a camera, an 800 MHz laptop, a 180 degree laser scanner and two microphones mounted to it. The subgroup I worked with used Viola-Jones AdaBoost learning algorithms to improve the detection in faces, and we were pretty happy to achieve 30 frames per second performance with our face detector on that 800 MHz laptop. (Merely ten years later, I realized that my telephone can do this now.) Later on, I joined the VAMPIRE project to extend our face detection algorithms to general objects in an office setting before applying the same approach to the topic of hand gesture detection in my diploma thesis.

The learning algorithm relied on training material: hundreds faces in greyscale picture of 20×20 pixels each, and lots of non-faces. We split the overall training set into two different sets: one for training the detector, and another for evaluating the results before moving on to the next stage of the detector. (The speed of that detector was based on rejecting 60% of non-faces in the first stage, while later stages dealt with the remainders.) The central numbers for the learning algorithm and how good it could identify faces in image data were the false positive and false negative rates.

What’s that? Basically, I think most people that lived through the Corona virus pandemic might realize the terms. If you trained a detector, you use the evaluation data set to evaluate how good your results are. The evaluation data set is labelled, so for the face detectors we had a sub-set of face pictures and a subset of non-face pictures. We fed the whole evaluation set to the trained detector stage to evaluate how many faces the detector detected as faces. That would be a true positive (TP). If a face was rejected (i.e. not detected), it yielded a false negative (FN). If the detector detected a face in one of the non-face pictures, that yielded a false positive (FP), whereas where it rejected the non-face, we had a true negative (TN).

After evaluating the whole evaluation set, we could then calculate the false negative rate and the false positive rate. The number of rejected faces (FN) divded by the overall number of faces yielded the false negative rate. The number of detected faces in non-face pictures (FP) divided by the number of non-face pictures yielded the false positive rate. Overal the face detectors where pretty much useless (i.e. they detected too many faces where a human could clearly see there is none), if the false positive rate was above 5% or so. When we managed to train a detector with a false positive rate below 0.5%, results started to get interesting.

Interestingly the two rates have to work together. You can achieve a false positive rate of 0.000% by detecting no faces at all. By then the false negative rate will be 100%, though. So it makes sense to balance the training towards low false positive rates and false negative rates combined.

Back to AI. Or more precise: LLMs at the current point in time. Based on my work back in university, I think that the currently available models used still have a pretty high false positive rate. I’m not sure about the false negative rate, though. For LLMs to be percevied as a better version of Eliza by the general public, they must not reject unsure answers too soon. That’s why AI companies let them hallucinate. Based on my work in university, I would claim that today’s LLM models probably still have a 30-40% false positive rate. That’s what a critical human being percevies as hallucination. Compared to the image detection case I worked on in the university, that’s still pretty high, though in a language setting that might be more ok than in the image detection use case.

After realizing this, I tried to back up my hunch with data. Surely, there is someone out there trying to independently evaluating the performance of LLM models. Just then, a news item popped up on my feed reader (yeah, I still use these). As it turns out, there are a few mentioned there:

I was surprised by the numbers listed on the evaluation models. They look worse than my hunch suggested. That’s bad.

Why? Tomorrow’s models will be trained by the AI slob that’s posted by marketing agencies on the internets right now. If the new training data consists of mostly hallucinated texts, tomorrow’s results cannot get any better. That said, by our today’s AI hype use, we create our own problems of tomorrow.

In a nutshell, that is why I think, AI is a hype bubble, that’s going to bust. I am not sure what will happen then…

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Markus Gärtner

Leave a Reply Cancel reply

Software Testing, Craft, Leadership and beyond