This specific failure might be s kind of averaging problem, where common answers around the general theme are preferred over more specific (and correct). LLMs can also fail completely at trivial concepts such as negation, or separating between "Y above X" and "Y below X".
I've seen it with statistics as well, asking it to implement some things in code. You'll get working but mathematically wrong code.
It might help but you have to be the backstop when it comes to the final call. Measuring the false positive/false negative rate could be tedious, but it's important to have a good estimate of, in order to use it wisely.
Is typing your requirements that much easier than going through traditional search filters at Digikey?
For this kind of task, you probably want a model that has specifically been trained on every product datasheet ever, and not ten million reddit threads and forum posts about how a 555 or 328p can solve any problem.
I doubt that chatgpt has been fed every datasheet for every part made in the last decade or two. Even if it had, that's likely far outweighed by the amout of noise coming from people talking about the most common parts.
But fundamentally I'm not sure that LLMs are great for this type of work. No two datasheets are the same and I've never seen one that wasn't missing some kind of information. What you very much do not want is an LLM hallucinating a value that does not actually exist in the datasheet. Or have it conflate two parts and mix up their values. These models just don't seem to be up to the task of returning real information from abstract queries. They're just meant to generate probabilistic text sequences.
Talk to it about something you don't know about, and you'll think it's really good technology ;)