GPT-5.4 Crosses the Human Baseline — What That Means for How We Work

OpenAI’s newest model doesn’t just chat — it executes multi-step workflows across software, scoring above human performance on real-world benchmarks.

OpenAI just dropped GPT-5.4, and the numbers tell a story that goes beyond benchmark bragging rights. On OSWorld-V — a test that measures whether AI can actually navigate real software environments, click buttons, fill out forms, and chain together multi-step tasks — it scored 75%. The human baseline? 72.4%.

That gap matters more than it sounds. It means we’re no longer comparing AI to a theoretical ceiling. We’re watching it clear the bar that working professionals set. The model ships with a million-token context window, meaning it can hold an entire codebase, legal document, or research corpus in memory while it works.

For students and educators, this shift is worth paying attention to. The AI tools entering classrooms and workplaces aren’t just answering questions anymore — they’re completing tasks. Understanding how these systems think, where they fail, and what guardrails they need is exactly the kind of literacy that will separate informed users from everyone else.

OpenAI has also surpassed $25 billion in annualized revenue, with early moves toward a public listing potentially as soon as late 2026. The business of AI is now as significant as the technology itself.