DeepReinforce Releases Ornith-1.0 on June 25: MIT-Licensed Open-Source Agentic Coding LLMs (9B–397B) Hit Up to 82.4 on SWE-bench Verified

DeepReinforce has released Ornith-1.0, a family of open-source models built specifically for agentic coding tasks in real terminal and repository environments, marking a targeted push toward autonomous software development workflows that matter to crypto and other open-source ecosystems. Announced on June 25 and distributed under the MIT license without regional restrictions, the suite spans four sizes—9B, 31B, 35B mixture of experts, and a 397B mixture-of-experts flagship—positioned for developer pipelines rather than general-purpose chat. The smallest model’s headline result is notable: the 9B variant records 69.4 on SWE-bench Verified, surpassing Google’s Gemma 4-31B at 52.0.

AI Integration

Ornith-1.0 is designed for agentic AI, where a model receives a task and takes a sequence of actions to complete it, instead of replying in a single exchange. In coding, that means reading files, running tests, identifying failures, applying fixes, and iterating until the issue is resolved. This action-oriented approach—already central to how autonomous programs are being used in cryptocurrency contexts—shifts the focus from conversational fluency to end-to-end task completion inside constrained, auditable environments like containers and source repositories.

Unlike many coding agents that rely on a fixed, human-crafted harness to decide when to call tools or how to decompose problems, Ornith treats the scaffold itself as something to be learned alongside the model’s policy. During reinforcement learning, each training step unfolds in two stages: the model first proposes a refined strategy for the task, then applies that plan to produce a solution. Rewards from outcomes propagate to both stages. Over many iterations, Ornith is not just trained to write better code; it is trained to develop more effective strategies for solving multi-step problems without a hand-written playbook.

Technology Use Case

The intended arena for these models is the developer workflow that runs inside terminals and code repositories—exactly the kind of setup used to maintain open-source projects and, increasingly, to manage complex pipelines that touch cryptocurrency tooling. The emphasis is on real execution in containerized environments and on test-driven resolution of issues, including debugging asynchronous behavior and addressing security vulnerabilities, as captured in task suites used to evaluate agentic coding.

To mitigate reward hacking—a risk when an agent can influence the very scaffold it learns—DeepReinforce describes three defenses. First, the environment and test suite are fixed and remain outside the model’s control. Second, a deterministic monitor flags any attempt to access restricted paths or alter verification scripts. Third, a frozen judge model sits above the automated verifier with veto power. Together, these controls are intended to keep the agent’s incentives aligned with actually solving the task, rather than gaming the validation process.

The Numbers

The 397B flagship posts 82.4 on SWE-bench Verified, an evaluation that gives an AI a real bug from an open-source GitHub repository and scores success by whether the issue is fixed without exposing the test suite. That result edges out Claude Opus 4.7’s 80.8 and DeepSeek-V4-Pro’s 80.6 on the same benchmark. On Terminal Bench 2.1—89 containerized terminal tasks covering areas from debugging to security—the flagship registers 77.5, compared to 70.3 for Claude Opus 4.7.

Given public concerns about SWE-bench contamination—raised earlier this year around the possibility that models learned benchmark solutions during training—Ornith also reports performance on SWE-bench Pro, a more demanding variant built to reduce leakage from training data. On that test, the 397B model records 62.2. While meaningfully lower than its SWE-bench Verified score, it remains competitive and still ahead of Deepseek V4 Pro on the same measure.

The 9B model stands out for its efficiency. Its 69.4 score on SWE-bench Verified not only tops Gemma 4-31B’s 52.0 but also sits close to Qwen 3.5-35B’s 70, despite being three to four times smaller by parameter count. In practical terms, that means a compact model targeted at agentic coding can deliver results that rival or surpass larger peers on the same coding tasks, which is relevant for teams running self-hosted infrastructure or edge deployments.

Industry Relevance

The lab positions Ornith-1.0 as “a self-improving” open-source family for agentic coding. The framing underscores where the most commercially relevant progress is happening in 2026: systems that can sustain multi-step development work with minimal supervision. For software efforts that underpin blockchain networks, crypto exchanges, and developer tools, the ability to operate inside repositories and terminals and repeatedly close the loop from failure to fix aligns with how these projects are maintained in practice.

The comparison headlines require context. Ornith-1.0-397B outperforms Claude Opus 4.7 on specific coding benchmarks, but Anthropic’s current flagship, Claude Opus 4.8, scores higher. The meaningful comparison is within the open-source category and at comparable parameter counts on agentic, coding-specific tasks—precisely the slice of the market where organizations look for self-hosted options and where incremental improvements can translate into faster iteration on codebases that matter to crypto and other open infrastructure.

Who It’s For—and Who It Isn’t

Ornith-1.0 is not a general-purpose assistant. Its own documentation notes that performance may fall off on tasks outside agentic coding. It is optimized for developer pipelines in which an AI agent receives a task, works inside a code repository or terminal session, and completes multi-step work without ongoing human intervention. That also means it is built for teams that have already set up agent infrastructure and automated evaluation rather than for users seeking help with everyday writing, document summaries, or academic work.

For developers building self-hosted coding pipelines, agentic infrastructure, or similarly focused workloads, the smaller and medium models may be the most practical entries in the Ornith lineup. Their footprint and specialization make them candidates for environments where running on local or edge hardware is part of the requirement, while the flagship model’s scores show what the approach can deliver at larger scales.

Market Impact

DeepReinforce’s prior work on CUDA-L1 and the IterX code-agent optimization loop sets the backdrop for Ornith-1.0’s release on Hugging Face. In the current landscape, where every lab is chasing performance on agentic coding evaluations—“because that’s where the useful performance differences live,” as Decrypt reported—the arrival of a model family purpose-built for real terminal and repository operations adds another open-source option to the tools that crypto and open-source teams already use to test, repair, and harden code. The MIT license and absence of regional restrictions further lower the barrier to experimentation for organizations that want to evaluate agentic coding in environments aligned with their security and compliance practices.

For now, the story is narrow and clear. Ornith-1.0 focuses on agentic coding rather than general chat, treats scaffolding as part of what the model can learn, and posts competitive results on benchmarks designed to reflect real engineering work. In a year where agentic AI is defining how software—and by extension the infrastructure of cryptocurrency—gets built and maintained, DeepReinforce’s release slots directly into the workflows that matter, with performance claims spelled out and caveats about scope made explicit.