Generative AI: a 1.51TB open model just took the crown

An MIT-licensed Chinese model now leads the open-weights board — while the week's quietest paper argued your prompts don't need to be readable at all.


The frontier-lab drama this week was about access being taken away — but the open-weights world spent it shipping. GLM-5.2 landed as the strongest open model on the public leaderboards under a permissive license, and the research feed filled with work on squeezing more out of every token. For builders, the takeaway is concrete: the gap between "the model you can call" and "the model you can host" keeps shrinking.


Watch & Listen First


Key Takeaways

The leading open-weights model is now Chinese and MIT-licensed. GLM-5.2 tops the Artificial Analysis Intelligence Index among open weights and ranks #2 on Code Arena WebDev — at roughly a third of frontier API pricing.

Prompt engineering is becoming a compiler problem. Cisco's FAPO classifies why a pipeline step failed and beat the prior best automated optimizer in 15 of 18 benchmark runs.

Retrieval is going on-device. Liquid AI's 350M retrievers run on a laptop CPU and cover 11 languages — small enough that RAG no longer needs a GPU server.

Your prompts may not need to be human-readable. New research compressed text to 27.9% of its length while keeping 99.5% of the meaning an LLM can recover.

Treat "model personality" benchmarks with suspicion. A 56-model study found 81–90% of apparent LLM "psychological" differences are response bias, not real traits.

The Big Story

GLM-5.2 is probably the most powerful text-only open weights LLM · June 17 · simonwillison.net
Z.ai's GLM-5.2 is a 753B-parameter, 1.51TB Mixture-of-Experts model with a 1-million-token context window (up from GLM-5.1's 200,000), released under an MIT license and now the leading open-weights model on the Artificial Analysis Intelligence Index v4.1 at a score of 51. The catch is economics, not capability: it burns ~43k output tokens per Intelligence Index task (up from GLM-5.1's 26k), so even at the ~$1.40/$4.40 per-million input/output rates Willison notes most providers charge, the "cheap open model" can run hot on token-heavy agentic work. For builders, the strategic shift is that the #2 model on the Code Arena WebDev leaderboard — behind only Claude Fable 5 — now has downloadable weights you can self-host instead of rent.


Also This Week

Cisco AI introduces FAPO, a prompt optimizer with step-level failure attribution · June 20 · MarkTechPost
FAPO classifies each pipeline failure by root cause — retrieval, cascade, format, or reasoning — then chooses prompt edits or structural changes accordingly, winning 15 of 18 model-benchmark comparisons against the GEPA optimizer with a mean +14.1pp gain; the so-what is that hand-tuning prompts is finally becoming an automatable, attributable engineering step rather than vibes.

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M for Fast Multilingual Search Across 11 Languages · June 19 · MarkTechPost
Two 350M-parameter open retrievers — a dense bi-encoder and a token-level ColBERT late-interaction model scoring 0.605 NDCG@10 on NanoBEIR — ship with GGUF builds that run on CPUs and laptops, meaning production RAG no longer requires a hosted embedding service or a GPU.

Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation · June 16 · MarkTechPost
Alibaba's Qwen team extended its LLM stack into robotics with a manipulation model built on Qwen3.5-4B (trained on ~38,100 hours of data), a 60-layer video world model, and a navigation model in 2B/4B/8B sizes — the same open Qwen weights powering chat are now the substrate for action models, with two of the three shipping public code.


From the Lab

Large Language Models Do Not Always Need Readable Language · June 18 · arXiv
The "BabelTele" work shows you can encode a prompt into a compact, deliberately non-human-readable form — condensed to 27.9% of the original text — while an instruction-tuned model still recovers 99.5% of the semantic content. If it holds up, that decouples human readability from model comprehension and points at real token-cost savings for high-volume pipelines, at the price of prompts no engineer can eyeball.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact · June 18 · arXiv
Administering personality and risk instruments to 56 instruction-tuned models, researchers found that 81–90% of the apparent "trait" differences between models come from a directional response bias — a tendency to lean to one end of a scale — not from genuine personality, and that profiles can be manipulated just by choosing which items to score. The practical warning: anyone benchmarking a model's "alignment persona" with human psychometric tests is mostly measuring an artifact.


Worth Reading


The week governments fought over who can call which frontier model, the open-weights labs answered by handing builders ones they can run themselves — and the most interesting research argued the prompts feeding them don't have to make sense to us at all.