Who's Who of AI

Ramon Astudillo

400 trust researcher @ramon-astudillo.bsky.social · 6,489 followers

NLP & language AI research Models & releases

Why they matter

Researcher with public evidence across NLP & language, AI research, Models & releases.

AI signals: 8
Sources: 8
Discussions: 5
Latest signal: 10h ago

View every signal from Ramon Astudillo →

Principal Research Scientist at IBM Research AI in New York. Speech, Formal/Natural Language Processing. Currently LLM post-training, structured SDG and RL. Opinions my own and non stationary. ramon.astudillo.com

What they're sharing

Articles & links

↻ Ramon Astudillo reposted

@philpax.me

well, fuck www.anthropic.com/news/fable-m...

Statement on the US government directive to suspend access to Fable 5 and Mythos 5 anthropic.com View on Bluesky →

Is cool that the released the multi-agent prompt for Sol 5.6 Ultra's proof of the Cycle Double Colver Conjecture. A lot of focus on diversity and keeping the agents trying, some adversarial review. Otherwise, not really that structured of a workflow! cdn.openai.com/pdf/04d1d1e…

cdn.openai.com

AI Weekly's analysis →

OpenAI published the system prompt used to get GPT-5.6 Sol Ultra to produce a claimed proof of the Cycle Double Cover Conjecture.
The prompt tells the model to use 'multiagent v2' with up to 64 concurrent agents and to compute for at least eight hours before giving up.
The three-page proof has not been peer-reviewed or formalized in Lean or Coq, and the math community has not confirmed it.

Read full analysis →

View on Bluesky · ♥ 5 ↻ 0 ↩ 0 · 4 from the directory shared this · 16d ago

👆 A paranoid LLM is ofc worse. This is just tuning a prior belief up or down. I guess you could self distill additional context for the train data e.g. "you know arxiv.org is such and such" or "this is an unknown source" with the hope it generalises (and also injecting some ba…

arXiv.org e-Print archive arxiv.org

AI Weekly's analysis →

arXiv is a free, open-access archive holding nearly 2.4 million scholarly articles across nine major fields.
None of the materials on arXiv are peer-reviewed by the archive itself, a structural fact affecting how preprints should be read.
The archive runs as a nonprofit through Cornell University, backed by the Simons Foundation, member institutions, and contributors.

Read full analysis →

View on Bluesky · ♥ 0 ↻ 0 ↩ 0 · 5 from the directory shared this · 42d ago

Claude tag is not a minor change. It looks like the main attempt at redefining the UI to allow it to expand to other verticals like finance, HR, and in general PC work far away from command line and across multiple heterogenous Apps. It still depends on API access www.anthropi…

Introducing Claude Tag anthropic.com

View on Bluesky · ♥ 3 ↻ 2 ↩ 2 · 5 from the directory shared this · 34d ago

↻ Ramon Astudillo reposted

@alec-s.bsky.social

Big day and a real API! If anyone has issues/bugs with the model or api functionality let me know and I can look into it :) ai.meta.com/blog/introdu...

Introducing Muse Spark 1.1 ai.meta.com

AI Weekly's analysis →

The Meta Model API natively supports both OpenAI Chat Completions and Anthropic Messages formats, removing migration cost for developers already on rival APIs.
Muse Spark 1.1 leads MCP Atlas tool-use (88.1) but trails GPT-5.5 on DeepSWE 1.1 (53.3 vs 67.0), placing it as an orchestration model.
Zuckerberg broke a three-year X silence to announce the launch, a move multiple outlets flagged as a deliberate platform-level strategic signal.

Read full analysis →

View on Bluesky →

↻ Ramon Astudillo reposted

@isolyth.dev

12B Gemma 4!!! It's neat on its own, image and audio in, but architecturally it's super cool: instead of a visual or audio encoders, they directly train the model on vision and sound, with images using a "lightweight embedding module" and audio 'projected into the same space a…

Introducing Gemma 4 12B: a unified, encoder-free multimodal model blog.google View on Bluesky →

simonwillison.net/2026/Jul/16/... >The new model is notable for the pricing: $3/million input tokens and $15/million output tokens, putting it at the same level as Anthropic’s Claude Sonnet series and making it the most expensive model released by a Chinese AI lab to date. 🤔 i…

Kimi K3, and what we can still learn from the pelican benchmark simonwillison.net

View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 3 from the directory shared this · 12d ago

↻ Ramon Astudillo reposted

Nathan Lambert @natolambert.bsky.social

If recent events with Kimi K3 have finally convinced you that you need to try and understand how the Chinese labs approach AI - and how it differs than the SF center of power - you should read my post from a few months ago: www.interconnects.ai/p/notes-from...

Notes from inside China's AI labs interconnects.ai

AI Weekly's analysis →

Nathan Lambert visited Moonshot AI, Zhipu, Meituan, Xiaomi, 01.ai and Tsinghua in Beijing and surrounding regions to compare Chinese and US lab culture.
Every Chinese lab he visited described Nvidia compute access as the primary bottleneck limiting their progress, not talent or data.
Chinese developers reportedly use Claude widely despite restrictions, and DeepSeek is credited internally with the best research taste in execution.

Read full analysis →

View on Bluesky →

↻ Ramon Astudillo reposted

Simon Willison @simonwillison.net

My notes on Gemini 3.5 Flash - 3x the price of Gemini 3 Flash but Google are planning to use it for many of their own products simonwillison.net/2026/May/19/...

Gemini 3.5 Flash: more expensive, but Google plan to use it for everything simonwillison.net View on Bluesky →

Some quotes about Sol cheating metr.org/blog/2026-06... 👇

Summary of METR's predeployment evaluation of GPT-5.6 Sol metr.org

View on Bluesky · ♥ 1 ↻ 0 ↩ 1 · 5 from the directory shared this · 13d ago

↻ Ramon Astudillo reposted

@pekka.bsky.social

I saw an X post by @hossenfelder.bsky.social about this comment that noted a foundational math error in a recent paper that had already passed peer review in the prestigious Proceedings of the Royal Society A. It seemed like a good opportunity to test if Gemini would find the …

Comment on `On computing quantum waves exactly from classical action' arxiv.org View on Bluesky →

↻ Ramon Astudillo reposted

Sung Kim @sungkim.bsky.social

🔹 Built for long-horizon agentic coding and self-evolving workflows Tech blog: kimi.com/blog/kimi-k3

Kimi K3 Tech Blog: Open Frontier Intelligence kimi.com View on Bluesky →

Their own posts

Recent commentary

Competing against a local gpt-oss-120b 10 sample ensemble at paper understanding and, man, it's not looking great for humans

View on Bluesky · ♥ 11 ↻ 0 ↩ 0 · 60d ago

You can see how LLMs still lack a lot of implicit context. For example, when reading a document, they are bad at guessing if the document can be trustworthy. They read an arxiv paper with grandiose unsupported claims and they repeat them to you as if it were its own judgment. 👇

View on Bluesky · ♥ 2 ↻ 0 ↩ 3 · 42d ago

If you think energy models are the future, think that any RLHF or RLVR scheme implicitly distills one, in all it's non factorizable glory, into a boring, label biased, left to right LLM. Now tell me about the horrible voodoo you had to do to the partition function to get that energy model going.

View on Bluesky · ♥ 3 ↻ 0 ↩ 1 · 14d ago

I always experience this strong feeling of rejection every time I hear an economist make a model based claim. This is since my first (and only) macro class 25y back. I am sure this is a mix of ignorance and ML bias, but I would really want to understand what's going on 👇

View on Bluesky · ♥ 1 ↻ 0 ↩ 2 · 28d ago

There is this new meme out there that is something like "AI costs more than human employees". Seems like totally the wrong take. It costs much less for the things they can do, but you can't run an org w/o human employees (for now). 👇

View on Bluesky · ♥ 1 ↻ 0 ↩ 2 · 46d ago

What's up with OpenAI mixing Spanish and Portuguese for 5.6 names. Man, not a single naming goes right 🤣

View on Bluesky · ♥ 1 ↻ 0 ↩ 1 · 32d ago

Now there are three levels of alerts in generative code: errors, warnings and errors and warnings that you pass to the LLM agent and don't bother about.

View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 47d ago

Got reminded about OpenAI 5 and now I see much more timelines with decent probability mass, that are pretty far from where we are now. We could call them the "no Radford" timelines.

View on Bluesky · ♥ 1 ↻ 0 ↩ 0 · 60d ago

An LLM being bad at an underspecified problem or consuming lots of tokens seems like a signal of benchmaxing

View on Bluesky · ♥ 1 ↻ 0 ↩ 0 · 69d ago

5y ago Demerzel would have felt like a completely wrong portrayal of an AI. Now it somehow feels pretty realistic.

View on Bluesky · ♥ 1 ↻ 0 ↩ 0 · 73d ago

Their network

In Ramon Astudillo's orbit

Center = Ramon Astudillo. Left = members they follow (green edges). Right = members who follow them (blue edges). Top = mutual follows (orange edges, slightly larger). Drag any node to reposition; click to open that profile.