How it works
Trinity Models leverage a Sparse Mixture of Experts (MoE) architecture, which means only a subset of its parameters (experts) activate per token. This design significantly contributes to lower latency and reduced computational costs, especially when processing long contexts. The models are trained on a foundation of diverse, high-quality data, meticulously filtered and classified, and further enhanced with synthetic augmentation. This rigorous training hones their ability to handle complex scenarios, including precise tool calling, strict JSON schema adherence, error recovery, and maintaining conversational flow over many turns.
Trinity supports an impressive context window of up to 128K tokens for Nano and Mini variants, and 512K tokens for the Large Preview, enabling deep understanding and extended memory for applications. It offers native function calling and the generation of structured outputs adhering to defined JSON schemas. Developers can choose to deploy Trinity using its open weights, compatible with popular inference frameworks like vLLM, SGLang, and Llama.cpp for on-premise or cloud infrastructure, or utilize Arcee's managed, OpenAI-compatible API for quick integration.
Why use it
Trinity Models are engineered for agent reliability, excelling at accurate function selection, generating valid parameters, producing schema-true JSON, and gracefully recovering from tool failures. This makes them highly suitable for building robust AI agents. They provide coherent multi-turn conversations, retaining context and goals over extended sessions without requiring repeated explanations.
A key advantage is consistent capabilities across sizes, allowing seamless migration of workloads from lightweight edge devices (Nano) to powerful cloud environments (Mini, Large) without altering prompts or playbooks. The super-efficient attention mechanism reduces the cost of operating with long contexts, while strong context utilization ensures relevant and grounded responses. Furthermore, Trinity offers flexible deployment options through open weights and a managed API, catering to diverse development needs, from on-device applications to high-throughput cloud services and voice assistants.