Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="4cba27f4af5841bfc12a8437568cc7ed0ade2cfc2f6251aac61282e2319c" Subject: The Sequence Radar #526: The OpenAI Blitz: From GPT-4.1 to Windsurf From: TheSequence To: Hidden Recipient Date: Sun, 20 Apr 2025 11:00:37 +0000 X-Hiring: We are hiring, reach out at header-hacker@emailshot.io X-EmailShot-Signature: urxjWSYBSO2YW_NH0QOwuDBjboRV1zhjPS0iuN1O-bW2lIEy0D84DGKVkeik_dz7QTXztwC2BTqBQjlNfT22dA== --4cba27f4af5841bfc12a8437568cc7ed0ade2cfc2f6251aac61282e2319c Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable View this post on the web at https://thesequence.substack.com/p/the-sequenc= e-radar-526-the-openai Next Week in The Sequence: Our eval series continues with an examination of math benchmarks. In engine= ering we dive into Google=E2=80=99s new agentic toolkit. The research secti= on dives into GPT 4.1 and our opinion edition will dive into the world of s= ynthetic data.=20 You can subscribe to The Sequence below: TheSequence is a reader-supported publication. To receive new posts and sup= port my work, consider becoming a free or paid subscriber. =F0=9F=93=9D Editorial: What a Week for OpenAI Nobody should be surprised with the speed of progress at OpenAI but this we= ek was something. OpenAI just simply had one of the most impressive weeks i= n its history. ecently, it was reported that Sam Altman was going to spend = more time focused on product and strategy and I think we are seeing the res= ults of that.This week=E2=80=99s developments highlight a key trend: OpenAI= is rapidly transforming from a model provider into a full-stack AI platfor= m, setting the pace in reasoning, coding, and agentic infrastructure. The headline release was GPT-4.1, a substantial upgrade to the GPT-4 series= =2E GPT-4.1 introduces a staggering 1 m= illion-token context window, opening t= he door to entirely new workflows in long-context reasoning, large document= processing, and advanced instruction following. Alongside the flagship mod= el, OpenAI released 4.1-mini and 4.1-nano variants, optimizing for differen= t tradeoffs in latency, cost, and performance. These models are integrated = directly into ChatGPT and API endpoints, signaling OpenAI's intent to unify= its offerings around the 4.1 generation. Complementing the 4.1 release was the quiet debut of o3 and o4-mini, two ne= w reasoning-optimized models. Internally referred to as the "o-series," the= se models are built to excel at multi-step reasoning, web browsing, visual = tasks, and planning. OpenAI positions o3 as its most advanced reasoning mod= el to date, capable of handling sophisticated instructions with higher reli= ability. Both o3 and o4-mini are tightly integrated into ChatGPT and early = evaluations suggest improvements across reasoning, search, and multimodal c= omprehension. On the developer tooling front, OpenAI launched Codex CLI=E2=80=94an open-s= ource command-line coding assistant that runs locally but connects to OpenA= I models. Codex CLI is designed to work with existing codebases and termina= l workflows, providing real-time coding suggestions, file navigation, and p= roject-wide refactoring. This marks a strategic shift towards offering tool= s that embed AI into the operating system layer, giving developers agentic-= level augmentation directly in their shell environments. Perhaps the most strategic news of the week is the rumored acquisition of W= indsurf (formerly Codeium) for a reported $3 billion. Windsurf operates in = the coding assistant space, directly overlapping with GitHub Copilot and ot= her AI pair programming tools. If finalized, the acquisition would give Ope= nAI a more robust foothold in the IDE-level development experience and furt= her solidify its vertical integration across model, platform, and interface= =2E Taken together, these moves signal an acceleration in OpenAI=E2=80=99s ambi= tions. From model innovation to agentic reasoning, from developer tooling t= o strategic acquisitions, OpenAI is positioning itself as the central platf= orm for AI-native computing. The convergence of long-context models, reason= ing agents, and local developer tools suggests a future in which OpenAI doe= sn=E2=80=99t just power apps=E2=80=94it becomes the operating layer for int= elligent systems. =F0=9F=94=8E AI Research MineWorld In the paper "MineWorld: a Real-Time and Open-Source Interactive World Mode= l on Minecraft" [ https://substack.com/redirect/3798418d-2728-4f06-9a51-f14= 06f917a7a?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l2= 4U ], researchers from various institutions explore the potential of world = models for simulating and interacting with diverse environments and human/a= gent actions, highlighting their use in game engines and reinforcement lear= ning systems. This paper designs a Transformer decoder-based model that can= function as both a policy model and a world model by jointly capturing the= relationships between game states and actions. AgentRewardBench In the paper "AGENTREWARDBENCH: Evaluating Automatic Evaluations of Web Age= nt Trajectories" [ https://substack.com/redirect/55a7de7a-4d01-4a74-96b5-9e= 4a702f692a?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l= 24U ], researchers from McGill University, Mila Quebec AI Institute, Google= DeepMind, Polytechnique Montr=C3=A9al, and ServiceNow Research introduce A= GENTREWARDBENCH, a benchmark to assess the effectiveness of LLM judges in e= valuating web agent trajectories. This benchmark, comprising expert-annotat= ed trajectories from various web environments and LLM agents, reveals that = rule-based evaluations commonly used in the field tend to underreport the s= uccess rate of web agents. S1-Bench In the paper "S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking= Capability of Large Reasoning Models" [ https://substack.com/redirect/6cd8= 9280-b494-41d5-b6c0-43e7504cdea0?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j= -jnN48_jCfgM3m0rbAsF0l24U ], researchers from the Institute of Information = Engineering, Chinese Academy of Sciences, and the School of Cyber Security,= University of Chinese Academy of Sciences present S1-Bench, a novel benchm= ark for evaluating the "system 1" thinking capability of large reasoning mo= dels (LRMs) on simple, intuitive tasks. Their evaluation of 22 LRMs demonst= rates that these models often exhibit lower efficiency and a tendency to ov= erthink on simple questions compared to traditional smaller LLMs. Seedream=20 In the "Seedream 3.0 Technical Report [ https://substack.com/redirect/e52fe= a06-4aff-4e34-9ec1-aeea31123d64?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-= jnN48_jCfgM3m0rbAsF0l24U ]" researchers from ByteDance introduce Seedream 3= =2E0, a high-performance Chinese-Englis= h bilingual image generation foundatio= n model. The report details several technical improvements across the entir= e pipeline, resulting in enhanced alignment with complex prompts, better ty= pography generation, improved visual aesthetics, higher fidelity, and the c= apability for native high-resolution output. TEXTARENA In the paper "TEXTARENA [ https://substack.com/redirect/5b902da0-c502-476c-= 8ad0-c7bed63a6513?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0= rbAsF0l24U ]", researchers from the Centre for Frontier AI Research (CFAR),= A*STAR, Northeastern University, National University of Singapore, and MIT= introduce TextArena, a platform for evaluating the soft skills of language= models through competitive text-based games. The platform uses a TrueSkill= =E2=84=A2 rating system to rank models based on their game-playing abilitie= s and provides insights into skills like strategic planning, logical reason= ing, and adaptability. BitNet vNext The paper, " [ https://substack.com/redirect/4520b256-b43c-4b95-9c15-148a1d= 796a28?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U = ]BitNet b1.58 2B4T Technical Report [ https://substack.com/redirect/4520b25= 6-b43c-4b95-9c15-148a1d796a28?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jn= N48_jCfgM3m0rbAsF0l24U ]" [ https://substack.com/redirect/4520b256-b43c-4b9= 5-9c15-148a1d796a28?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3= m0rbAsF0l24U ] from Microsoft Research, introduces BitNet b1.58 2B4T, the f= irst open-source, native 1-bit Large Language Model with 2 billion paramete= rs, demonstrating performance comparable to full-precision models of simila= r size while offering significantly reduced memory footprint, energy consum= ption, and decoding latency. Key contributions include the model architectu= re based on 1.58-bit weights and 8-bit activations, a comprehensive evaluat= ion of its capabilities, and the public release of model weights and optimi= zed inference code for both GPU and CPU. =F0=9F=A4=96 AI Tech Releases Codex CLI OpenAI open sourced Codex CLI [ https://substack.com/redirect/c9717352-cd76= -4f8e-90a6-f5ac76fcc816?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jC= fgM3m0rbAsF0l24U ], a lightweight coding agent that can run in a terminal.= =20 Gemini 2.5 Flash Google released a preview version of Gemini 2.5 Flash [ http://Gemini 2.5 F= lash ]. Embed 4 Cohere launched a new Embed 4 [ https://substack.com/redirect/c166dd68-937f= -439a-ac4e-ee9d1b521abf?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jC= fgM3m0rbAsF0l24U ], its new embedding model to power retrieval applications= =2E=20 DeepCoder DeepCoder released a new 14 billion parameter code generation model [ https= ://substack.com/redirect/9a7a72af-efb7-48ce-828c-177f9bbd4227?j=3DeyJ1IjoiM= 2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 Classifier Factory Mistral released a suite of models for different classification tasks [ htt= ps://substack.com/redirect/64fb5194-a2ba-4445-8a04-71f593344184?j=3DeyJ1Ijo= iM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 =F0=9F=9B=A0 AI in Production AI at Salesforce Marketing Salesforce shares some insights [ https://substack.com/redirect/7c10ca34-3= f44-44ce-9ee0-421251eef59f?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48= _jCfgM3m0rbAsF0l24U ]about their AI infrastructure behind its Marketing Int= elligence platform. PayPal Agentic Toolkit PayPal released a new toolkit for agentic commerce [ https://substack.com/r= edirect/acda0bb6-2d62-4c60-9d0a-d3e665253d2e?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uF= ObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 =F0=9F=93=A1AI Radar OpenAI is in talks to buy AI coding startup Windsurf [ https://substack.com= /redirect/facbe7da-1b3c-483f-93b0-b13abaef1954?j=3DeyJ1IjoiM2dmeXZtIn0.xu76= uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ]. It was also reported that OpenAI attempted to buy Cursor [ https://substack= =2Ecom/redirect/f6dcdce8-3cb0-4a8a-ae5= 6-7d77818d51e9?j=3DeyJ1IjoiM2dmeXZtIn0.= xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 Ilya Sutskever Safe Superintelligence (SSI) is now valued at $32 billion [= https://substack.com/redirect/18f0c64b-1346-4944-9f9b-067ff5e75ad6?j=3DeyJ= 1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ]. Infinite Reality acquired AI firm Touchcast Inc for $500 million [ https://= substack.com/redirect/901f311f-a21a-41ef-91d3-8598f8ed8b3d?j=3DeyJ1IjoiM2dm= eXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ]. Former Y Combinator president Geoff Ralston raised a new VC fund focused on= AI safety [ https://substack.com/redirect/29bf4eb1-5adf-4e43-81a8-78fe57fd= 0c42?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].= =20 Robotics startup Xaba raised $6 million [ https://substack.com/redirect/07= 1f9b5f-25fd-4c5c-a97b-8a05329a8d0e?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP82= 2j-jnN48_jCfgM3m0rbAsF0l24U ]to implement industrial robots. Popular platform LMArena is now a company [ https://substack.com/redirect/d= 3367f19-add0-4c49-ac72-dd342d45decc?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP8= 22j-jnN48_jCfgM3m0rbAsF0l24U ].=20 1Fort raised $7.5 million [ https://substack.com/redirect/4c2a98b3-5dd0-4e= 55-9b8f-a22e2b5dbeab?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM= 3m0rbAsF0l24U ]to improve small business insurance with AI.=20 Kore.ai and G42 announced a strategic partnership [ https://substack.com/r= edirect/d083ea15-4cd7-427e-a0e9-f9adfc8c9f12?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uF= ObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ]to expand AI capabilities in the ent= erprise.=20 AI data platform Deck raised $12 million in a new round [ https://substack.= com/redirect/2713347c-ba11-4aec-a04c-9eccae48ecef?j=3DeyJ1IjoiM2dmeXZtIn0.x= u76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 Capsule raised $12 million [ https://substack.com/redirect/6b26b365-b2da-4= 6cb-914a-2f85224e236a?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfg= M3m0rbAsF0l24U ]for its AI video platform.=20 The team from eval platform Context.ai is joining OpenAI [ https://substack= =2Ecom/redirect/82335b26-a130-4271-a50= 7-fe05b8feba21?j=3DeyJ1IjoiM2dmeXZtIn0.= xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ]. AI voice platform Telli raised $3.6 million in a new round [ https://substa= ck.com/redirect/74b9a08e-fda2-4f74-8057-0f287e3f5848?j=3DeyJ1IjoiM2dmeXZtIn= 0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U ].=20 Atomic raised $3 million [ https://substack.com/redirect/7158ebce-14c7-427= 6-a476-c1c1c67dad9b?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3= m0rbAsF0l24U ]to build an AI-first invenctory planning system. Unsubscribe https://substack.com/redirect/2/eyJlIjoiaHR0cHM6Ly90aGVzZXF1ZW5= jZS5zdWJzdGFjay5jb20vYWN0aW9uL2Rpc2FibGVfZW1haWw_dG9rZW49ZXlKMWMyVnlYMmxrSW= pveU1Ea3dNVGMwTWpZc0luQnZjM1JmYVdRaU9qRTJNVE0zTmpJeU1pd2lhV0YwSWpveE56UTFNV= FEzTURrNUxDSmxlSEFpT2pFM056WTJPRE13T1Rrc0ltbHpjeUk2SW5CMVlpMDFORE13T1NJc0lu= TjFZaUk2SW1ScGMyRmliR1ZmWlcxaGFXd2lmUS5uZlJiM2daOXBkMDNjenJqVFFKbXNpdzNIWFd= VYnQxZWJaa1o5MjY4TzhZIiwicCI6MTYxMzc2MjIyLCJzIjo1NDMwOSwiZiI6dHJ1ZSwidSI6Mj= A5MDE3NDI2LCJpYXQiOjE3NDUxNDcwOTksImV4cCI6MTc0NzczOTA5OSwiaXNzIjoicHViLTAiL= CJzdWIiOiJsaW5rLXJlZGlyZWN0In0.piY5gen3YLBGjLdpgb4BOni99dABWZY5FXYTl2MLAkU? --4cba27f4af5841bfc12a8437568cc7ed0ade2cfc2f6251aac61282e2319c Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable The Sequence Radar #526: The Open= AI Blitz: From GPT-4.1 to Windsurf
New models, acquisitions= and tools signal a rapid expansion plan.
͏     ­T= 7;     ­͏     ­͏     &= #173;͏     ­͏     ­͏   =   ­͏     ­͏     ­͏= ;     ­͏     ­͏     &#= 173;͏     ­͏     ­͏   &= #8199; ­͏     ­͏     ­͏=     ­͏     ­͏     = 73;͏     ­͏     ­͏   &#= 8199; ­͏     ­͏     ­͏ =     ­͏     ­͏     = 3;͏     ­͏     ­͏   = 199; ­͏     ­͏     ­͏ &= nbsp;   ­͏     ­͏     ­= ;͏     ­͏     ­͏   Q= 99; ­͏     ­͏     ­͏ &n= bsp;   ­͏     ­͏     ­= ͏     ­͏     ­͏   ̳= 9; ­͏     ­͏     ­͏ &nb= sp;   ­͏     ­͏     ­&= #847;     ­͏     ­͏    = ; ­͏     ­͏     ­͏ &nbs= p;   ­͏     ­͏     ­&#= 847;     ­͏     ­͏    = ­͏     ­͏     ­͏  = ;   ­͏     ­͏     ­= 47;     ­͏     ­͏     = ­͏     ­͏     ­͏  =   ­͏     ­͏     ­T= 7;     ­͏     ­͏     &= #173;͏     ­͏     ­͏   =   ­͏     ­͏     ­͏= ;     ­͏     ­͏     &#= 173;͏     ­͏     ­͏   &= #8199; ­͏     ­͏     ­͏=     ­͏     ­͏     = 73;͏     ­͏     ­͏   &#= 8199; ­͏     ­͏     ­͏ =     ­͏     ­͏     = 3;͏     ­͏     ­͏   = 199; ­͏     ­͏     ­͏ &= nbsp;   ­͏     ­͏     ­= ;͏     ­͏     ­͏   Q= 99; ­͏     ­͏     ­͏ &n= bsp;   ­͏     ­͏     ­= ͏     ­͏     ­͏   ̳= 9; ­͏     ­͏     ­͏ &nb= sp;   ­͏     ­͏     ­&= #847;     ­͏     ­͏    = ; ­͏     ­͏     ­͏ &nbs= p;   ­͏     ­͏     ­&#= 847;     ­͏     ­͏    = ­͏     ­͏     ­͏  = ;   ­͏     ­͏     ­= 47;     ­͏     ­͏     = ­͏     ­͏     ­͏  =   ­͏     ­͏     ­T= 7;     ­͏     ­͏     &= #173;͏     ­͏     ­͏   =   ­͏     ­͏     ­͏= ;     ­͏     ­͏     &#= 173;͏     ­͏     ­͏   &= #8199; ­͏     ­͏     ­͏=     ­͏     ­͏     = 73;͏     ­͏     ­͏   &#= 8199; ­͏     ­͏     ­͏ =     ­͏     ­͏     = 3;͏     ­͏     ­͏   = 199; ­͏     ­͏     ­͏ &= nbsp;   ­͏     ­͏     ­= ;͏     ­͏     ­͏   Q= 99; ­͏     ­͏     ­͏ &n= bsp;   ­͏     ­͏     ­= ͏     ­͏     ­͏   ̳= 9; ­͏     ­͏     ­͏ &nb= sp;   ­
Forwarded this email= ? Subscribe here for more<= /div>
=

Was this email forwar= ded to you? Sign up here


The Sequence R= adar #526: The OpenAI Blitz: From GPT-4.1 to Windsurf

New models, acquisitions an= d tools signal a rapid expansion plan.

=
 
<= tr><= td style=3D"vertical-align:middle;">
3D""
=3D""
3D""
3D""
READ IN APP
 
3D""<= /td>
Created Using GPT-4o

Next Week i= n The Sequence:

Our eval series continues wit= h an examination of math benchmarks. In engineering we dive into GoogleR= 17;s new agentic toolkit. The research section dives into GPT 4.1 and our o= pinion edition will dive into the world of synthetic data.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts an= d support my work, consider becoming a free or paid subscriber.

📝 Editorial: What a Week for OpenAI

Nobody should be surprised with the speed of progress at OpenAI but t= his week was something. OpenAI just simply had one of the most impressive w= eeks in its history. ecently, it was reported that Sam Altman was going to = spend more time focused on product and strategy and I think we are seeing t= he results of that.This week’s developments highlight a key trend: Op= enAI is rapidly transforming from a model provider into a full-stack AI pla= tform, setting the pace in reasoning, coding, and agentic infrastructure.

The headline release was GPT-4.1, a substantial upgrade to t= he GPT-4 series. GPT-4.1 introduces a staggering 1 million-token context wi= ndow, opening the door to entirely new workflows in long-context reasoning,= large document processing, and advanced instruction following. Alongside t= he flagship model, OpenAI released 4.1-mini and 4.1-nano variants, optimizi= ng for different tradeoffs in latency, cost, and performance. These models = are integrated directly into ChatGPT and API endpoints, signaling OpenAI's = intent to unify its offerings around the 4.1 generation.

Co= mplementing the 4.1 release was the quiet debut of o3 and o4-mini, two new = reasoning-optimized models. Internally referred to as the "o-series," these= models are built to excel at multi-step reasoning, web browsing, visual ta= sks, and planning. OpenAI positions o3 as its most advanced reasoning model= to date, capable of handling sophisticated instructions with higher reliab= ility. Both o3 and o4-mini are tightly integrated into ChatGPT and early ev= aluations suggest improvements across reasoning, search, and multimodal com= prehension.

On the developer tooling front, OpenAI launched= Codex CLI—an open-source command-line coding assistant that runs loc= ally but connects to OpenAI models. Codex CLI is designed to work with exis= ting codebases and terminal workflows, providing real-time coding suggestio= ns, file navigation, and project-wide refactoring. This marks a strategic s= hift towards offering tools that embed AI into the operating system layer, = giving developers agentic-level augmentation directly in their shell enviro= nments.

Perhaps the most strategic news of the week is the = rumored acquisition of Windsurf (formerly Codeium) for a reported $3 billio= n. Windsurf operates in the coding assistant space, directly overlapping wi= th GitHub Copilot and other AI pair programming tools. If finalized, the ac= quisition would give OpenAI a more robust foothold in the IDE-level develop= ment experience and further solidify its vertical integration across model,= platform, and interface.

Taken together, these moves signa= l an acceleration in OpenAI’s ambitions. From model innovation to age= ntic reasoning, from developer tooling to strategic acquisitions, OpenAI is= positioning itself as the central platform for AI-native computing. The co= nvergence of long-context models, reasoning agents, and local developer too= ls suggests a future in which OpenAI doesn’t just power apps—it= becomes the operating layer for intelligent systems.

🔎= AI Research

MineWorld=

In the paper "MineWor= ld: a Real-Time and Open-Source Interactive World Model on Minecraft", researchers from various institutions explore the potential = of world models for simulating and interacting with diverse environments an= d human/agent actions, highlighting their use in game engines and reinforce= ment learning systems. This paper designs a Transformer decoder-based model= that can function as both a policy model and a world model by jointly capt= uring the relationships between game states and actions.

AgentRewardBench

In the paper "AGENTREWARDBENCH:= Evaluating Automatic Evaluations of Web Agent Trajectories", researchers from McGill University, Mila Quebec AI Institute, Google = DeepMind, Polytechnique Montréal, and ServiceNow Research introduce AG= ENTREWARDBENCH, a benchmark to assess the effectiveness of LLM judges in ev= aluating web agent trajectories. This benchmark, comprising expert-annotate= d trajectories from various web environments and LLM agents, reveals that r= ule-based evaluations commonly used in the field tend to underreport the su= ccess rate of web agents.

S1-Bench

In th= e paper , researchers from the Instit= ute of Information Engineering, Chinese Academy of Sciences, and the School= of Cyber Security, University of Chinese Academy of Sciences present S1-Be= nch, a novel benchmark for evaluating the "system 1" thinking capability of= large reasoning models (LRMs) on simple, intuitive tasks. Their evaluation= of 22 LRMs demonstrates that these models often exhibit lower efficiency a= nd a tendency to overthink on simple questions compared to traditional smal= ler LLMs.

Seedream

In t= he "Seedream 3.0 Technical Report" r= esearchers from ByteDance introduce Seedream 3.0, a high-performance Chinese-English bilingual image generation foundati= on model. The report details several technical improvements across the enti= re pipeline, resulting in enhanced alignment with complex prompts, better t= ypography generation, improved visual aesthetics, higher fidelity, and the = capability for native high-resolution output.

TEXTARENA<= /strong>

In the paper "<= a href=3D"https://substack.com/redirect/5b902da0-c502-476c-8ad0-c7bed63a651= 3?j=3DeyJ1IjoiM2dmeXZtIn0.xu76uFObqArDfP822j-jnN48_jCfgM3m0rbAsF0l24U" rel= =3D"" style=3D"color: #c5168c;text-decoration: none;">TEXTARENA"<= /span>, researchers from the Centre for Frontier AI Research= (CFAR), A*STAR, Northeastern University, National University of Singapore,= and MIT introduce TextArena, a platform for = evaluating the soft skills of language models through competitive text-base= d games. The platform uses a TrueSkill™ rating system to rank models = based on their game-playing abilities and provides insights into skills lik= e strategic planning, logical reasoning, and adaptability.

BitNe= t vNext

The paper, "BitNet b1.58 2B4T Technical Report" from Microsoft Research, introduces BitNet b1.58 2B4T, the fi= rst open-source, native 1-bit Large Language Model with 2 billion parameter= s, demonstrating performance comparable to full-precision models of similar= size while offering significantly reduced memory footprint, energy consump= tion, and decoding latency. Key contributions include the model architectur= e based on 1.58-bit weights and 8-bit activations, a comprehensive evaluati= on of its capabilities, and the public release of model weights and optimiz= ed inference code for both GPU and CPU.

🤖 AI Tec= h Releases

Codex CLI

OpenAI open sourced Codex CLI, a lightweight coding agent that can run in a terminal.

Google released a preview version of Gemini 2.5= Flash.

Embed 4

Cohere launched a new Embed= 4, its new embedding model to power retrieval applications.

DeepCoder

DeepCoder released a new 14 billion parameter co= de generation model.

Classifier Factory

Mistral released a suite of models for different classification tasks.

🛠 AI in Production

AI at Sales= force Marketing

Salesforce shares some insights about th= eir AI infrastructure behind its Marketing Intelligence platform.

PayPal released a new toolkit for agent= ic commerce.

📡AI Radar

<= ul style=3D"margin-top: 0;padding: 0;margin-bottom: 0;">
  • OpenAI is in talks to buy AI coding startup Windsurf= .

  • that OpenAI attempted to buy Cursor.

  • Ilya Sutskever = Safe Superintelligence (SSI) is now valued at $32 billion.

  • Infinite Reality acquired AI firm Tou= chcast Inc for $500 million.

  • Former Y Combinator president Geoff Rals= ton raised a n= ew VC fund focused on AI safety.

  • Robotics startup Xaba raised $6 million to implement industrial robots.

  • Popular platform LMArena is now a company.

  • =

    1Fort <= /span>raised $7.5 mil= lion to improve small business insurance with AI.

  • Kore.ai and G42 announced a stra= tegic partnership to expand AI capabilities in the enterprise.

  • AI da= ta platform Deck raised $12 million in a new round.

  • Capsule raised $12 million for= its AI video platform.

  • The team from eval platform Context.ai is joining OpenAI.=

  • AI = voice platform Telli raised $3.6 million in a new round.

  • Atomic raised $3 million to b= uild an AI-first invenctory planning system.

  • You’= ;re on the free list for TheSequence Scope and TheSequence = Chat. For the full experience, become a paying subscriber= to TheSequence Edge. Trusted = by thousands of subscribers from the leading AI labs and universities.

    3D""
    =
     
    <= table role=3D"presentation" width=3D"auto" border=3D"0" cellspacing=3D"0" c= ellpadding=3D"0" style=3D"margin:0 auto;">
    3D""Like
    =
    3D""Comment
    <= td align=3D"center">3D""Restack
     

    © 2025 Jesus= Rodriguez
    75 Miracle Mile, Suite 7688, Coral Gables, FL 33134 Unsubscribe

    3D"Get3D"Start

    3D""3D"" --4cba27f4af5841bfc12a8437568cc7ed0ade2cfc2f6251aac61282e2319c--