AI Weekly: Smarter Models, Riskier Agents — February 16-22, 2026
Gemini 3.1 Pro and Grok 4.20 pushed reasoning and multi-agent design forward this week, while OpenClaw's skill marketplace showed the real cost of agent sprawl.
Two frontier labs shipped genuine reasoning jumps this week, and the "personal AI agent" category that's been generating headlines since January hit its first real security reckoning. If you're running a small business and either using or being pitched a self-hosted AI agent, this week is worth ten minutes of your attention.
Gemini 3.1 Pro's reasoning leap
Google DeepMind released Gemini 3.1 Pro on February 19, and the benchmark jump is hard to ignore. It scored 77.1% on ARC-AGI-2, more than double the previous Gemini 3 Pro's 31.1%, and hit 94.3% on GPQA Diamond, the highest score reported on that benchmark to date.
Why this matters for SMEs: ARC-AGI-2 tests novel problem-solving, not memorised patterns. A model that reasons better on problems it hasn't seen before makes fewer of the confident-but-wrong mistakes that turn AI-assisted work into a liability. If you're using AI for anything beyond drafting (contract review, financial modelling, custom workflow logic), model-level reasoning gains like this compound quickly.
Grok 4.20 bakes a team into one model
xAI shipped Grok 4.20 in beta on February 17 with a production multi-agent system running by default on complex queries: a coordinator agent breaks down the task, then hands pieces to specialised sub-agents for research, code/logic, and synthesis before returning one answer.
This is the same idea we covered in the Feb 3-10 post about Claude Code's custom agents, where developers manually route tasks to Opus, Sonnet or Haiku depending on the job. Grok 4.20 bakes that specialisation into the model itself rather than leaving it to the user.
The takeaway: agent specialisation is moving from "power-user technique" to "default product behaviour." Within a year or two, asking "which model should handle this?" may stop being a question humans need to answer at all.
OpenClaw's skill marketplace becomes a warning
OpenClaw, the open-source personal AI agent that went from zero to 60,000+ GitHub stars in days after its January rename, ran into a real security problem this week. Researchers auditing OpenClaw's community skill marketplace traced a coordinated attack ("ClawHavoc") that had planted malicious entries across the registry. By February 16, confirmed malicious skills had grown to over 824 out of more than 10,700 available, on top of a critical privilege-escalation flaw (CVSS 8.8) patched earlier in the month.
Why this matters for SMEs: a personal AI agent that can read your email, run shell commands and browse the web on your behalf is only as safe as every plugin you let it install. Treat agent "skills" exactly like npm packages from an unknown publisher:
- Check who published it and how long it's existed
- Read the permissions it requests before installing, not after
- Don't self-host agent frameworks with broad system access on a machine that also holds business-critical data
- If a vendor pitches you an "autonomous AI agent," ask directly how they vet the marketplace it pulls from
This is exactly the gap we help clients close when we scope Claude Code automation builds with proper permission boundaries from day one, rather than bolting security on after something goes wrong.
The pattern this week: models keep getting smarter and more agentic by default, but the infrastructure around agents (marketplaces, plugins, permissions) is where the real risk now sits. Worth remembering before you hand any AI system the keys to your business.
Want a second opinion on an AI agent setup before you deploy it? Get in touch and we'll walk through it with you.
