Give LLMs Another Shot

Imagine you have a bright new intern at work assigned to help you. You ask her to create a presentation making the case for an investment in a company you’ve been looking at. You don’t provide any background materials, no context about the firm’s investment criteria, no guidance on format or audience, or any other supporting information. Would you be surprised if she came back with something generic, unconvincing, and unusable?

Now give the same task to an LLM, with no supporting information or guidance. It will produce something equally off the mark — and you’d quickly dismiss it and close the tab, concluding that LLMs aren’t useful.

In both examples, the failure isn’t with the intern or the LLM, but with the request.

The Ways People Fail With AI

There are two ways I see smart people fail with these tools, and both are completely understandable.

The first: they ask an LLM something it genuinely can’t know. A narrow question in a domain where they’re already the expert. Something that requires proprietary data, firm-specific context, or specialized knowledge the model has no access to. The answer comes back confident and wrong. They flag the errors, conclude the tool is unreliable, and stop there. That’s a reasonable response to what happened — it just isn’t the full picture.

The second failure mode is subtler. They give the model a vague, open-ended task with no supporting materials, no format guidance, no context about what “good” looks like. Ask it to produce something without telling it what you already know, what constraints apply, or what you’ve already tried. The output is generic because the input was generic. That feels like a tool problem. It’s actually an interaction problem.

Both lead to the same conclusion: LLMs aren’t worth the trouble. And in both cases, that conclusion isn’t wrong so much as it’s incomplete. LLMs are tools, and like any tool, they work only when used on the right problems the right way.

A Harvard/BCG study of 758 consultants illustrates why. Given the same AI tools, the same tasks, and the same time, consultants who worked on tasks inside AI’s capability zone outperformed their baseline by 25%. The ones who worked on tasks outside that zone underperformed by 19%. Same tool. Same people. The difference was whether the task was a good match for what the tool can actually do.

That gap — between a 25% boost and a 19% drag — is almost entirely explained by task selection, not skill or effort. Which means before you can use these tools well, you need a clearer picture of what you’re actually dealing with.

What You’re Actually Dealing With

You’ve probably heard the “brilliant intern” analogy — AI as a capable but inexperienced assistant who needs direction. It’s not wrong, but it undersells a few things that matter.

A more accurate version: “an extremely well-read colleague who has processed more text than any human alive, produces fluent and confident-sounding output in any situation, has no memory of your firm or your clients, no street smarts, and no ability to recognize when they’re out of their depth.” It doesn’t quite roll off your tongue like “brilliant intern,” but it gets more to the heart of the problem.

That last part is the one that bites people. A junior employee, when asked something they don’t know, usually signals uncertainty — they slow down, hedge, ask a clarifying question. An LLM doesn’t do that. It keeps going. The fluency is the same whether the answer is correct or completely fabricated. In 2023, a lawyer submitted a legal brief that cited several court cases ChatGPT had confidently provided. The judge discovered that none of the cases existed. The model hadn’t said “I’m not sure” — it had produced authoritative-sounding citations for cases it had invented.

This doesn’t mean the tool isn’t useful. It means the tool is not a replacement for your judgment. It’s a force multiplier on your judgment. If you can evaluate what it gives you, you can use it well. If you can’t — if you’re asking it to be the expert you aren’t, in a domain where you can’t spot a wrong answer — you’re not getting a second opinion. You’re just getting a confident one.

The boundary between where these tools excel and where they fail isn’t intuitive. It doesn’t map neatly to “simple tasks” versus “complex tasks.” It maps to what’s well-represented in the model’s training data versus what isn’t: your firm’s proprietary data, real-time information, specialized domain knowledge, anything that requires actual judgment about the world as it exists right now.

Researchers call this the “jagged frontier.” The capability edge is uneven in ways you wouldn’t predict. An LLM might draft a strong investment memo framework and then confidently fabricate a company’s revenue figure in the same session. The tool isn’t consistently brilliant or consistently unreliable — it’s context-dependent in ways that aren’t always visible until you know what to look for.

Once you internalize that shape, the practical guidance becomes obvious.

Getting Better Results

The people I’ve seen get consistent value from these tools aren’t doing anything exotic. They’ve figured out three things, usually through trial and error. You don’t have to wait for the trial and error.

Choose the right task. LLMs are genuinely good at synthesis, drafting, summarizing, explaining, structuring, brainstorming, and stress-testing arguments. They’re not good at knowing things they have no way of knowing — your firm’s history, a client’s preferences, proprietary data, recent events past their training cutoff. The investment presentation request failed partly because it was asking the model to do analysis it had no raw material for. Ask it to help structure the argument for an investment thesis you’ve already formed, and you’ll get something much more useful.

Give it real context. The quality of what you get back is almost entirely determined by the quality of what you put in. Think about what you’d hand to a smart new hire on their first day — the relevant background, the materials they’d need, the format that works for your audience, the constraints that matter. Give the AI the same package.

Iterate, and verify what matters. This is not a search engine where you type a query and get an answer. It’s a conversation. A mediocre first response is a starting point, not a verdict. Push back. Add specifics. Tell it what you liked and what you didn’t. Ask it to try again differently.

And for anything consequential — figures, facts, specific claims you’re going to act on — verify independently. The model’s confidence is not a signal of accuracy. The fluency that makes these tools useful is the same quality that makes their errors hard to spot. Build the habit of checking factual claims the same way you’d check a junior analyst’s work before it goes to a client.

A recent example brings all three of these points together. My wife was using Claude to produce a summary document from an offsite meeting. She’d photographed eight whiteboards worth of notes, votes, and brainstorming. The first attempt, along the lines of “extract the text from these photos and produce a summary,” came back unfocused. On the second try, she asked specifically for a five-slide PowerPoint with key themes and a raw-notes appendix. This time the result was better, but generically formatted. On the third try, she attached the company’s existing PowerPoint template. That version needed only minor tweaks before it was ready to share. Three iterations. Each one added more guidance and context to address a shortcoming. Each one materially improved the output.

That’s not a workaround for a flawed tool. That’s how the tool works. If you were doing this with an intern, you might get a little annoyed at having to ask them to try it again repeatedly, just as they might get frustrated, tired, and bored with each iteration. But LLMs don’t get tired. They don’t complain about having to repeat a task. And they can iterate quickly. Easy, cheap iteration is a feature of the LLM, not a bug.

What to Try

None of this requires a firm-wide initiative or a new software budget. It requires about an hour and a willingness to try something that might not work the first time.

Here are three starting points that work well and don’t require putting sensitive data into a system you’re not sure about yet.

Summarize something long. Drop a company update, a research report, a contract, or a board deck into ChatGPT or Claude and ask for a plain-English summary of the key points and anything worth flagging. Be specific about the audience and what you care about. This is a task well inside the capability zone — synthesizing and distilling text is exactly what these models are trained on — and the output is easy to evaluate because you already know the source material.

Stress-test a decision. If you’re working through a problem and want a sanity check, give the model the context and ask it to argue the other side. Not “what are the risks of X” — that produces a generic list. Give it your actual reasoning and ask it to find the weaknesses. A good LLM will push back in ways that are genuinely useful, and even when it’s wrong, it often surfaces assumptions worth examining.

Draft something you’ve been putting off. An update for your manager, a policy memo, a response to a difficult email. Give it the context, the audience, the tone you want, and a rough sense of the key points. Then edit the draft rather than starting from scratch. You’ll spend less time on the part you find tedious and more time on the part that actually requires your judgment.

One more thing worth knowing: Ethan Mollick, a Wharton professor who has studied these tools seriously, makes a specific claim — that if you spend ten hours actually using AI for real work (not experimenting, not playing around, but working), you’ll develop a much clearer picture of where it earns its place. Ten hours is the threshold where the tool stops feeling unpredictable and starts feeling like something you understand. That’s the actual commitment required.

The Tab You Closed

Go back to the person who closed the tab after the generic investment presentation. That tool is roughly the same one that’s helping other firms draft memos, summarize research, and work through decisions faster than they could on their own. The difference isn’t the tool. It’s what they brought to it, and what they asked it to do.

The people getting real value from these tools aren’t more technical. Most of them aren’t even particularly systematic about it. They just put in enough time to figure out where the tool fits and where it doesn’t — and they stopped expecting it to work like something it isn’t.

Your first bad experience wasn’t a verdict. It was a data point from one interaction, using an approach that didn’t match what the tool is actually good at.

Pick one of the three tasks above and try it this week. Not to be impressed — these tools will disappoint you if that’s the bar. Try it to see what happens when you give it something it can work with, the context it needs, and a clear sense of what you’re looking for. That’s a different experience than the one that caused you to close the tab. It might be enough to change your mind.