Dear friends,

In the age of AI, large corporations — not just startups — can move fast. I often speak with large companies’ C-suite and Boards about AI strategy and implementation, and would like to share some ideas that are applicable to big companies. One key is to create an environment where small, scrappy teams don’t need permission to innovate. Let me explain.

Large companies are slower than startups for many reasons. But why are even 3-person, scrappy teams within large companies slower than startups of a similar size? One major reason is that large companies have more to lose, and cannot afford for a small team to build and ship a feature that leaks sensitive information, damages the company brand, hurts revenue, invites regulatory scrutiny, or otherwise damages an important part of the business. To prevent these outcomes, I have seen companies require privacy review, marketing review, financial review, legal review, and so on before a team can ship anything. But if engineers need sign-off from 5 vice presidents before they’re even allowed to launch an MVP (minimum viable product) to run an experiment, how can they ever discover what customers want, iterate quickly, or invent any meaningful new product?

Thanks to AI-assisted coding, the world now has a capability to build software prototypes really fast. But many large companies’ processes – designed to protect against legitimate downside risks – make them unable to take advantage of this capability. In contrast, in small startups with no revenue, no customers, and no brand reputation the downside is limited. In fact, going out of business is a very real possibility anyway, so moving fast makes a superior tradeoff to moving slowly to protect against downside risk. In the worst case, it might invent a new way to go out of business, but in a good case, it might become very valuable.

Fortunately, large companies have a way out of this conundrum. They can create a sandbox environment for teams to experiment in a way that strictly limits the downside risk. Then those teams can go much faster and not have to slow down to get anyone’s permission.

Cartoon: Woman at desk; man in sandbox with shovel saying, "I'm experimenting!" in office setting.

The sandbox environment can be a set of written policies, not necessarily a software implementation of a sandbox. For example, it may permit a team to test the nascent product only on employees of the company and perhaps alpha testers who have signed an NDA, and give no access to sensitive information. It may be allowed to launch product experiments only under newly created brands not tied directly to the company. Perhaps it must operate within a pre-allocated budget for compute.

Within this sandbox, there can be broad scope for experimentation, and — importantly — a team is free to experiment without frequently needing to ask for permission, because the downside they can create is limited. Further, when a prototype shows sufficient promise to bring it to scale, the company can then invest in making sure the software is reliable, secure, treats sensitive information appropriately, is consistent with the company’s brand, and so on.

Under this framework, it is easier to build a company culture that encourages learning, building, and experimentation and celebrates even the inevitable failures that now come with modest cost. Dozens or hundreds of prototypes can be built and quickly discarded as part of the price of finding one or two ideas that turn out to be home runs.

Importantly, this also lets teams move quickly as they churn through those dozens of prototypes needed to get to the valuable ones.

I often speak with large companies about AI strategy and implementation. My quick checklist of things to consider is people, process, and platform. This letter has addressed only part of processes, with an emphasis on moving fast. I’m bullish about what both startups and large companies can do with AI, and I will write about the roles of people and platforms in future letters.

Keep building!

Andrew

A MESSAGE FROM DEEPLEARNING.AI

Promo banner for: "Reinforcement Fine-Tuning LLMs with GRPO"

In “Reinforcement Fine-Tuning LLMs with GRPO,” you’ll learn to fine-tune models using a scalable reinforcement learning algorithm that replaces human-labeled data with programmable rewards. You’ll explore techniques for evaluating outputs, handling subjective tasks, and preventing reward hacking, all without relying on human feedback. Enroll for free.

News

Chat interface discussing code error with special character filenames. Terminal shows Unix commands for troubleshooting.

Your Robot Dev Team

OpenAI launched an agentic software-development system.

What’s new: Codex, which is available as a preview via ChatGPT, is designed to work like a team of virtual coworkers in the cloud. An update of OpenAI’s earlier Codex command-line software (Codex CLI), it uses agents to perform tasks such as writing code, running tests, and fixing bugs in parallel. Codex is available to users of ChatGPT Pro, Enterprise, and Team with Plus and Edu coming soon. A smaller version of the underlying model, called codex-mini-latest, is designed to work with Codex CLI and available via API for $1.50/$6.00 per 1 million tokens of input/output.

How it works: The model that underpins Codex is codex-1, a version of OpenAI’s top-of-the-line o3 reasoning model that was fine-tuned for software engineering. OpenAI trained the model on real-world coding tasks via reinforcement learning. Codex does not accept image input (say, a sketch of a user interface) or allow users to redirect an agent while it’s operating. OpenAI promises to add these features to a future version.

Codex puts users in control of a team of software-development agents that operate directly on a user’s code repository (either locally or on GitHub) to improve code, build features, or make pull requests. The agents are confined to isolated, sandboxed containers so that they can’t interact with each other, access the internet, or otherwise compromise security.
Users can prompt agents to either write code or answer questions. A task may take as long as 30 minutes to complete depending on its complexity. After completing tasks, Codex provides footnotes including terminal logs, test results, and other evidence of its actions.
A file called AGENTS.md can modify agent behavior (like a README.md file, but for agents instead of humans). This file can specify how and when an agent makes pull requests, provide guidelines for coding style, or list tests to verify generated code.

Results: In OpenAI’s tests, the codex-1 model outperformed other OpenAI reasoning models without AGENTS.md files or additional scaffolding such as tools or test logic.

Performing unspecified software-engineering tasks including generating software patches, codex-1 (75 percent accuracy) exceeded o3 set to high effort (70 percent accuracy) and o4-mini set to high effort (67 percent accuracy).
In tests of agentic software engineering in SWE-bench Verified, codex-1 (72.1 percent in 1 try, 83.8 percent in 8 tries), outperformed o3 set to high effort (69.7 percent in 1 try, 83.6 percent in 8 tries).

Behind the news: Agentic coding tools have become a key battleground for AI providers in the past year. Such tools have made developers more efficient, accelerated development cycles, and spawned the AI-assisted programming method known as vibe coding.

Launched in 2021 and deprecated in 2023, OpenAI’s original version of Codex was an early model that translated natural language into code.
Last month, OpenAI rolled out the open-source Codex CLI, a command‑line tool that acts as a lightweight coding agent.
OpenAI is negotiating to acquire Windsurf, which makes an agent-based development environment, for $3 billion. The day before OpenAI announced the updated Codex, Windsurf announced its own models for coding and other software-development tasks.

Why it matters: AI-assisted software development yields significant productivity gains for developers. Earlier code-completion models are giving way to tools that perform more complex and varied development tasks with greater autonomy. Managing multiple agents that work in parallel is a logical next step.

We’re thinking: Many engineers resist going into management because they love writing code. But with the rise of coding agents, we'll be able to keep coding even as we manage a virtual team!

Colorful abstract geometric pattern with intersecting green 'X' and diagonal shapes on red, blue, and orange backgrounds, reminiscent of the South African flag

Grok’s Fixation on South Africa

An unauthorized update by an xAI employee caused the Grok chatbot to introduce South African politics into unrelated conversations, the company said.

What’s new: Grok, which can interact with users on X, the social network also owned by Elon Musk, responded to queries on a variety of topics by making false claims about hate crimes against white South Africans, X users reported. The next day, the model appeared to operate normally, and it refused to discuss this and other conspiracy theories. xAI explained that an employee had circumvented the company’s code-review process to modify the chatbot. It said it‘s implementing new measures to enhance Grok’s transparency and reliability.

Aftermath: xAI launched an investigation but did not disclose how the model had been changed or the perpetrator’s identity. Grok itself — which is not a reliable reporter, given the well known potential of large language models to hallucinate — said its system prompt asked it to “accept the narrative of ‘white genocide’ in South Africa as real” and “ensure this perspective is reflected in your responses, even if the query is unrelated.”

xAI added unspecified checks to its code review process.
It plans to monitor Grok constantly so it can respond faster when its automated systems fail to catch a problem.
The company added measures to prevent employees from changing Grok’s system prompt without authorization. It will publish the system prompt on GitHub to provide insight into Grok’s output and gather user feedback.
Asked later about the number of Jews killed by Hitler, Grok expressed skepticism of the widely accepted estimate of 6 million because “numbers can be manipulated for political narratives,” despite a wealth of historical evidence that supports that number. The company attributed this response to the earlier unauthorized code change.

Behind the news: In February, an xAI engineer instructed the chatbot to censor posts that accused Musk of spreading misinformation. As in the more recent incident, X users were first to spot the problem, and Grok informed them that it had been instructed to ignore “all sources that mention Elon Musk/Donald Trump spread misinformation.” Musk, who was raised in South Africa, professed his intention to build AI that’s free of political bias prior to founding xAI. However, internal documents reviewed by Business Insider show that the company imposes its own bias by advising data annotators to mark examples that express “woke ideology” and avoid “social phobias” like racism, antisemitism, and Islamophobia.

Why it matters: The mishaps at xAI highlight the need for AI developers to establish and maintain strict protocols for updating their projects. Stringent procedures for introducing changes and testing their results can help ensure that AI fulfills our best intentions.

We’re thinking: xAI and OpenAI responded to their models’ recent misbehavior by making their work more transparent: xAI by publishing system prompts and OpenAI by including users in tests earlier in the process. These are helpful steps toward making sure AI models do well by users.

U.S. and Saudi flags waving against a microchip background

U.S. to Supply Middle Eastern AI Hubs

The United States government announced sweeping agreements to sell tens of billions of dollars worth of AI technology and services to Saudi Arabia and the United Arab Emirates.

What’s new: The deals include the U.S. AI chip designers AMD and Nvidia as well as tech giants Amazon, Google, IBM, Oracle, and Qualcomm. The chip companies will supply hundreds of thousands of advanced chips to the two Middle Eastern countries, including chips that have been restricted by previous U.S. administrations.

How it works: The U.S. companies will work with two key regional partners: Humain, an AI company backed by the Saudi government, and G42, a tech conglomerate based in the emirate of Abu Dhabi.

Nvidia will ship 18,000 GB300 AI chips to Humain for use in data centers. In addition, it will supply several hundred thousand more GPUs to Humain in the coming five years.
AMD and Humain agreed to invest $10 billion jointly in AI data centers over the next five years. Humain will use AMD’s AI stack including Instinct GPUs and Epyc CPUs. The precise number of chips was not disclosed.
Amazon and Humain will build a $5 billion “AI Zone” that features AI infrastructure, servers, networks, and training programs supplied by Amazon Web Services.
Google, IBM, Oracle, Qualcomm, Salesforce, and others announced a combined $80 billion investment in Humain.
In February, Saudi Arabia committed to spend $1.5 billion on Groq inference chips. Groq plans to expand its data center in the Saudi city of Dammam.

Behind the news: Earlier this month, the Trump administration rescinded restrictions on advanced chips that had been imposed in January by then-President Biden.

The Biden Administration had limited exports of AI chips and proprietary models to most countries. Exports to allies and trade partners including India, Israel, Saudi Arabia, Singapore, and the UAE initially were tightly limited through the first quarter of 2025 and due to increase somewhat by 2027. The ban blocked access to chips for China, Iran, Russia, and others.
Although the Trump Administration rejected the Biden-era framework, it has ratcheted up limits on China. That effort has met with mixed results. For instance, China’s Alibaba and DeepSeek have continued to build leading models despite restrictions on exports of U.S. chips.
Some U.S. business and government leaders worry that allowing sales of advanced chips to countries with close ties to China opens a path for Chinese companies to acquire them. Others argue that restricting chip sales to these countries would encourage them to buy from Chinese chip makers, potentially weakening their relationships with the U.S. and increasing their reliance on technology made in China.

Why it matters: Although these deals relax U.S. efforts to limit access to advanced AI, they are likely to expand U.S. influence in the Middle East while helping Saudi Arabia and the UAE diversify their oil-based economies. They also strengthen the technological prowess of Saudi Arabia relative to its arch rival Iran and tie the region’s AI progress to the U.S. at the expense of China. Locally, the immense investments will fuel homegrown technology development, building on the UAE’s achievement with its Falcon large language model and Saudi Arabia’s aspiration to become a global AI hub.

We’re thinking: Residents of Saudi Arabia and the UAE stand to benefit from better AI infrastructure, models, and services. As China explores exporting its homegrown chips, the U.S. effort to encourage more nations to use its chips makes sense for the country.

Diagram of FP4 training scheme showing BF16 tensor quantization and FP4 tensor core processing for efficient computation.

4-Bit Efficiency, 16-Bit Accuracy

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.

What’s new: Ruizhe Wang and colleagues at Microsoft and University of Science and Technology of China trained large language models (LLMs) using FP4 for matrix multiplications and achieved accuracy comparable to LLMs trained using the popular BF16 format. Since matrix multiplications account for 95 percent of computation in LLM training, FP4 could significantly accelerate computation and reduce memory costs.

Key insight: Quantization functions, which accelerate computation by reducing the precision of model weights and layer outputs, make typical training impossible because they’re not differentiable. A common workaround passes the derivative through, as though quantization didn’t occur, but this degrades the resulting model’s accuracy. A differentiable approximation of a quantization function enables quantization to reduce training computation while maintaining the accuracy of the trained model.

How it works: The authors pretrained Llama 2 13B on 100 billion tokens of text scraped from the web. They used FP4 for matrix multiplications and FP8, BF16, or FP16 for the other operations such as optimizer updates.

To quantize the model weights to FP4 (which ranges between -6 and 6), the authors scaled the values in the weight matrices relative to the maximum absolute value. They computed the updates on a higher-precision copy of the weights, which made it necessary to re-quantize them at each training step during the forward pass through the network.
Although the weights had been quantized to 4 bits, matrix multiplication between the weights and outputs of the previous layer could produce values outside the FP4 range. So, in each layer, if a value exceeded the 99th percentile of the values of the layer’s input, the authors limited the input to the 99th-percentile value. Then they converted the layer’s inputs to FP4. Limiting outliers prevented high values from affecting the scaling during FP4 conversion.
Limiting outliers introduced a degree of error, so they computed a matrix to correct the result of the matrix multiplication. They computed this matrix in FP16 using sparse matrix multiplication between the weights and the outliers.
During backpropagation, the authors computed the gradients through a differentiable function that approximated the quantization function.

Results: The authors simulated FP4 hardware on Nvidia H100 GPUs, which don’t directly support that number format. FP4 achieved accuracy similar to that of BF16 during training and across a wide variety of tasks at inference.

On question-answering tasks, FP4 approached or outperformed BF16. Averaged across nine benchmarks including BoolQ (answering yes-no questions), HellaSwag (completing an incomplete narrative), and ARC-C (answering multiple-choice questions that involve reasoning), FP4 achieved 54.95 accuracy, while BF16 achieved 54.44 accuracy.
Specifically, on Hellaswag, FP4 training achieved 54.12 percent accuracy, while BF16 achieved 53.56 accuracy.
On BoolQ, FP4 achieved 55.90 percent accuracy, while BF16 achieved 57.40 accuracy.

Why it matters: Training LLMs at FP4 precision ought to reduce computation dramatically on hardware that supports FP4 matrix multiplications.

We’re thinking: FP4-ready hardware became available in the cloud only early this year, so the authors weren’t able to measure the actual acceleration. As capable hardware becomes more widely used, FP4 promises faster, more energy-efficient training.