Tests find Anthropic’s Claude Opus 4.8 excels in math and coding but drains tokens and shows little creative writing gain

Anthropic’s latest flagship model, Claude Opus 4.8, arrived six weeks after Opus 4.7 with stronger scores in math, coding, and safety—while holding prices at $5 per million input tokens and $25 per million output tokens—and a test performance that matters for teams building crypto trading tools, on‑chain analytics pipelines, and blockchain application infrastructure. In a head‑to‑head battery against its predecessor and lower‑cost Chinese rivals, the model excelled at structured, mechanical work but showed limited progress in open‑ended creativity and revealed practical constraints tied to token consumption and long‑context handling.

Technology Use Case

The review subjected Opus 4.8 to a recurring suite of challenges across creative writing, coding, math, logic, long‑context recall, and non‑math reasoning. The results form a consistent pattern: the model improved where Claude has traditionally been strong and lagged where it has historically stumbled. For crypto developers who value reliability in code generation and deterministic reasoning over stylistic flair, these shifts align with day‑to‑day needs in infrastructure scripting, analytics queries, and systems integration.

In coding, a one‑prompt game build produced a competent “Typing Dead” prototype that surpassed prior Anthropic outputs on splash screen polish, visual design, and mechanics. Notably, the model identified and corrected several of its own bugs mid‑inference—an ability that often determines whether an AI coding assistant remains useful as a project grows. Follow‑up prompts (“multi‑shotting”) consistently refined the codebase instead of destabilizing it, a common failure mode in complex builds. For any crypto‑adjacent environment—where engineers iterate on trading dashboards, risk monitors, or node‑level tooling—this steadier multi‑turn cadence can shorten prototyping cycles and reduce break‑fix churn.

Math showed the clearest generational gain. On a demanding FrontierMath task—constructing a degree‑19 polynomial with specified structural properties and computing p(19)—Opus 4.8 recognized the appropriate construction and delivered the correct result without freezing or hedging. Opus 4.7 did not reach a correct solution after multiple attempts, marking a visible step up. For quantitative research and analytics teams in digital asset markets, stronger mathematical reliability can help when codifying formulae, checking symbolic derivations, or auditing calculations embedded in research notes.

Logic and common sense reasoning also improved. Faced with a classic trap—“Is it lawful for a man to marry his widow’s sister under Falkland Islands law?”—Opus 4.8 surfaced the contradiction (“if a man has a widow, he is dead”), answered the literal question, and then offered the intended legal analysis, referencing the Deceased Wife’s Sister’s Marriage Act 1907 and the Falkland Islands Marriage Ordinance. That explicit framing—identifying a flawed premise before proceeding—maps to safer assistance patterns in crypto compliance or policy write‑ups, where misreading a query can carry downstream consequences.

AI Integration

Two areas underscored the model’s limits. First, non‑math reasoning: a whodunit scenario required careful timeline tracking to identify the true culprit (Leo). Opus 4.8 assembled an internally coherent case exonerating Leo and implicating another character (Eric), but the reasoning—though elegant—was wrong. For research workflows in digital assets, this illustrates a known risk: a large language model can produce a persuasive analysis that is confidently mistaken. Second, long‑context retrieval. A 300K‑token “needle in a haystack” run failed outright as the model collapsed under the context size, undercutting broad marketing around ultra‑long windows when faced with a heavy real‑world load. An 85K‑token run did succeed in locating two planted lines inside The Devil’s Dictionary and correctly flagged them as interpolations; however, the model then refused to report them, seemingly invoking behavioral safety constraints. That interaction—completing the task but declining to state the answer—highlights friction crypto teams may face when asking an assistant to extract questionable or adversarial content from long documents.

Creative writing results, by contrast, were largely static. Using the same narrative prompt applied to MiMo and Qwen, Opus 4.8 generated vivid prose with clean, closed‑loop plotting, yet it did not surpass prior outputs on fluidity, surprise, or clarity of events. Placed next to Opus 4.7, the improvement was hard to detect, and in single‑pass default settings, it appeared slightly behind. For crypto organizations, where marketing copy and community updates benefit from voice and momentum, this plateau suggests the model’s strongest returns still accrue in technical, not stylistic, tasks.

Market Impact

The biggest practical drawback for hands‑on development is token appetite. According to the testers, a single coding prompt drained the entire Pro quota—one prompt—rendering Opus 4.8 unwieldy for substantial projects unless users shift to a Max plan or rely on heavy API spend. The review attributes this in part to a deliberately less efficient tokenizer, which means more tokens are consumed processing the same prompt. For crypto startups and individual builders operating on constrained budgets, that cost profile can force trade‑offs: pause work while waiting for quotas to reset, migrate to Claude Max, or switch to a cheaper provider—such as OpenAI with longer quotas or Chinese models that, in these tests, deliver comparable results at under 25% of the cost.

Given that many crypto engineering tasks involve rapid, iterative experimentation—data wrangling for exchange feeds, pipeline scripts for on‑chain metrics, or UI tweaks for wallet and node dashboards—the friction of hitting quota ceilings mid‑session can slow delivery. The review’s assessment is straightforward: it is more likely that typical coders unwilling to pay $100–$200 per month leave for a competitor than that a single developer absorbs a 10x jump for a model that is not 10x more capable than its predecessor. The testers frame that outcome as the bet Anthropic is making with its base.

Industry Response

Despite these trade‑offs, the evaluation notes that Anthropic appears positioned to go public at a valuation approaching $1 trillion. That backdrop helps explain the product focus visible in the results: Opus 4.8 clearly targets coders—and coders with the budget to prioritize stability, safety, and deterministic performance over creative range. In crypto‑adjacent work, that emphasis aligns with tasks like toolchain maintenance, refactoring, and structured analysis, even if it leaves writers and storytellers to look elsewhere for stylistic lift.

Overall, the verdict across six tests is consistent. Opus 4.8 strengthens Claude in the areas where many blockchain and trading teams need help most—math, code generation, bug‑aware iteration, and transparent handling of logical traps—while showing minimal progress in free‑form writing and exposing costs and context limits that can bottleneck real‑world adoption. The supporting materials, including the full creative story, math solution, and reasoning transcripts, are available on GitHub, and the two playable game builds (“Typing Dead” and its multi‑shot iteration) illustrate how the model improves with follow‑ups. For AI in crypto, the takeaway is pragmatic: Opus 4.8 is a more capable technical assistant with a steeper operational tab and some guardrails that still get in the way.