Build
Choosing a Model
A practical framework for picking the cheapest model that does your agent's job reliably — plus our current tier-by-tier recommendations.
Pickaxe is model-agnostic. You can pick from 40+ models across OpenAI, Google, Anthropic, xAI (Grok), Mistral, and Perplexity, and switch any time without rebuilding your agent. That freedom is great, but it raises an obvious question: which one should you actually use?
The short answer: the cheapest model that does your job reliably. This guide gives you a way to find that model that holds up even as the lineup changes, plus our current picks for the four jobs most agents need.
The one rule that won't change
Start lower, upgrade only when quality isn't good enough.
It's tempting to default to the smartest, most expensive model "just to be safe." Resist that. Most agent tasks (answering FAQs, qualifying leads, drafting content, routing questions) run great on mid-tier or cheap models. Premium models cost more per message and often respond slower, so paying for one you don't need just burns credits and adds latency.
Pick a sensible starting model, test it on your real prompts in the Preview tab, and only move up a tier if the output genuinely falls short.
What you're trading off
Every model choice balances three things:
- Intelligence — how well it reasons, follows instructions, and handles nuance
- Speed — how fast it responds (matters a lot for live chat)
- Cost — how much each message consumes
You can't max all three. A frontier "deep thinking" model is smart but slower and pricier; a lite model is fast and cheap but less capable on hard tasks. A fourth factor matters when you upload a lot of source material: context window, or how much text the model can consider at once. Big knowledge bases need a model with a large context window.
Timeless best practices
These hold no matter which models are current:
- Match the model to the task, not the hype. Simple, structured, high-volume work → cheap model. Nuanced judgment, multi-step reasoning, or high-stakes accuracy → premium model.
- Your prompt and knowledge base usually matter more than the model. If results are weak, sharpen your Instructions and clean up your sources before you pay for a bigger model. A great prompt on a mid model beats a lazy prompt on a frontier one.
- Reasoning models are for accuracy, not speed. "Deep thinking" models pause to reason, so they're slower and cost more. Use them where being right matters more than being fast, not for snappy live chat.
- If your agent uses Actions, don't pick the very cheapest model. Multi-step tool use trips up lite models. Use the Everyday tier or higher when Actions are involved, and test the full chain.
- Test on your own inputs, not benchmarks. Leaderboards don't reflect your use case. Use Preview, impersonate a real user, and try your trickiest real questions.
- Newer usually beats older at the same price. When a provider ships a new model, the same tier often gets smarter, faster, or cheaper. Revisit your choice every few months.
- You're never locked in. Switching is one dropdown. A/B two models on the same prompt in Preview and keep the winner.
A 30-second way to choose
Ask yourself:
- How hard is the task? Repetitive/structured → cheap. Genuine reasoning or analysis → premium.
- How high is the volume / how tight is the budget? High volume or thin margins → lean cheaper.
- Does it need to feel instant? Live chat → favor fast/lite models. Background or "thinking" tasks → speed matters less.
- Does it use Actions? Yes → Everyday tier or above for reliable tool use.
When in doubt, start in the Everyday tier below.
Our recommended models
Current picks as of June 2026. Models move fast, so treat the tiers as permanent and the specific names as a snapshot. The live list and pricing are always at pickaxe.co/models.
Cost key: $ = lowest cost · $$ = mid · $$$ = premium
| Job | Best for | Our pick | Cost |
|---|---|---|---|
| Cheap & Fast | High-volume, simple, latency-sensitive tasks: FAQs, routing, tagging, simple chat | Grok 4.1 Fast | $ |
| Everyday (start here) | Most agents: support, content, lead-gen, coaching, general assistants | ChatGPT 5.4 | $$ |
| Deep Thinking | Complex reasoning, multi-step analysis, high-stakes accuracy, heavy Action chains | Claude 4.8 Opus | $$$ |
Want the hard numbers? Our model comparison tool shows the real stats (cost, speed, context window, and more) for every model we offer and lets you compare them side by side before you commit.
Image generation
Image generation is a separate capability. Turn it on under Capabilities in the Agent Builder, then choose your image model from the dropdown.
| Job | Best for | Our pick | Cost |
|---|---|---|---|
| Image | Generating images inside your agent | GPT Image 2 | $$ |
What different models are good at
The picks above are solid starting points, but Pickaxe gives you 40+ models, and the best one depends on what your agent needs most. Here's how the current lineup sorts by strength. Full specs and live pricing for each are on pickaxe.co/models.
Advanced reasoning — for complex, multi-step problems where being right matters more than being fast. The strongest are Claude 4.8 Opus and Claude Fable 5 (Anthropic's frontier reasoning models), ChatGPT 5.5 and ChatGPT 5.4 Pro, Gemini 3 Pro, and Grok 4.3. They "think" before answering, so expect higher cost and slower replies.
Long context — for agents with large Knowledge Bases or long conversations that need to stay coherent. Gemini 3 Pro and the Gemini 3.1 Pro Preview offer frontier-scale context, with Grok 4.3 (1M-token window) and Anthropic's Opus line (Claude 4.8 Opus, Claude Fable 5) close behind.
Actions and tool use — for agents that call Actions or chain to other agents. Gemini 3.5 Flash is built for agentic, tool-using workflows and stays fast; ChatGPT 5.5 and Claude 4.8 Opus handle long, high-autonomy tool chains; Grok 4.3 is strong on structured outputs; Mistral Medium 3.5 and ChatGPT 5.4 mini are leaner options that still hold up. Avoid nano/lite models here, they're less reliable across multi-step Actions.
Really fast — for live chat, high volume, and anything latency-sensitive. Grok 4.1 Fast, Gemini 3.5 Flash and Gemini 3 Flash, Claude 4.5 Haiku, and ChatGPT 5.4 mini / nano return answers almost instantly at the lowest cost, trading some depth for speed.
Also worth knowing:
- Coding: ChatGPT 5.3 Codex, Gemini 3 Pro, and Mistral Medium 3.5 are tuned for software tasks.
- Live web research: Perplexity's Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research are search-native for current information.
- Lowest hallucination: Grok 4.20 offers the strictest instruction-following and most consistent factual accuracy.
How to set or switch your model in Pickaxe
- Open your agent in the Agent Builder and go to the Editor.
- Find the Model setting and choose from the dropdown. Switch any time, no rebuild required.
- For image generation, open Capabilities, toggle on image generation, and select an image model.
- Test in Preview with real prompts (impersonate a user to see exactly what they'd get).
- If your agent uses Actions, run the full workflow in Preview to confirm the model handles the tool calls reliably.
Tip: If results are inconsistent, try strengthening your Instructions or using a Model Reminder for must-follow rules before jumping to a more expensive model. The fix is often in the prompt, not the model.
Keep it current
The model landscape changes almost monthly: new releases, price drops, and tier shake-ups are constant. Use this guide's framework to decide, then confirm the current lineup at pickaxe.co/models and compare real stats side by side with our comparison tool before committing. For a tour of what the comparison page shows, see the Models overview. Revisit your agents every few months: a model that was premium last quarter is often this quarter's everyday default, at a lower price.
