AWS has deployed Cerebras CS-3 systems, increasing token processing speed by 5 times. This is the first significant breakthrough in inference infrastructure after a year of performance crisis.

AWS has announced the deployment of specialized Cerebras CS-3 systems through the AWS Bedrock service, enabling record-breaking AI inference speed. The new architecture uses AWS Trainium for preprocessing and Cerebras WSE for decoding, providing a fivefold increase in token throughput. This solution is critical for companies scaling AI applications in production.

Inference Architecture: Division of Labor Between Specialized Chips

The traditional approach to AI inference uses universal GPUs that handle all stages of computation. AWS and Cerebras have implemented a fundamentally different approach—dividing the process into two phases using hardware optimized for each stage. AWS Trainium handles the prefill phase (context cache filling), while Cerebras WSE specializes in the decode phase (token generation). This architecture eliminates bottlenecks that existed in monolithic systems. The result is a fivefold increase in token throughput, meaning the ability to serve five times more user requests on the same hardware. For businesses, this translates to lower cost per request and the ability to scale without proportionally increasing infrastructure costs.

Integration with AWS Bedrock and Support for Open Models

The deployment of Cerebras CS-3 is through AWS Bedrock—a managed service that provides access to various LLMs through a unified API. This means that companies can use both open models (including Amazon Nova) and other LLMs without needing to manage the infrastructure themselves. This approach lowers the entry barrier for small and medium-sized businesses that previously could not afford optimized inference systems. AWS Bedrock provides scalability, security, and compliance with corporate requirements. Companies gain access to high-performance inference without needing to hire GPU optimization specialists or purchase expensive equipment.

Context: The Crisis of Inference Infrastructure in the AI Industry

The deployment of Cerebras on AWS comes amid a growing industry understanding that current inference infrastructure is inadequate. Google researcher Xiaoyu Ma and Turing Award winner David Patterson recently published an article in IEEE Computer showing that the crisis in AI is not about training models, but about deploying them. The hardware used for inference was designed for entirely different tasks and is not optimized for working with LLMs. This creates performance bottlenecks and inefficient use of resources. The Cerebras-AWS solution addresses this very issue by offering a specialized architecture that was designed with the specifics of transformer operations in mind.

Practical Application for Business: Reducing Costs and Improving User Experience

The fivefold increase in throughput has direct economic implications. For companies deploying AI chatbots, recommendation systems, or other LLM-based applications, this means serving more users on the same infrastructure. Alternatively, a company can maintain the current number of users but reduce infrastructure costs by five times. For startups and medium-sized companies, this can be a critical factor in deciding whether to scale AI applications. Additionally, faster inference improves user experience—responses are generated faster, which is especially important for interactive applications. Companies like Alashed IT (it.alashed.kz), which help Kazakhstani and Central Asian companies implement AI solutions, gain the ability to offer their clients more cost-effective and efficient deployment options.

Что это значит для Казахстана

For companies in Kazakhstan and Central Asia considering AI application deployment, the deployment of Cerebras on AWS is directly relevant. AWS has regional data centers serving the Asia-Pacific region, ensuring low latency and compliance with local data storage requirements. The fivefold increase in inference performance means that Kazakhstani companies can deploy AI solutions with less infrastructure investment. This is especially important for financial institutions, telecommunications companies, and government agencies that view AI as a strategic priority. The cost of cloud computing in the region remains higher than in developed countries, so optimizing inference performance directly impacts the economic feasibility of AI projects. Companies deploying AI solutions through AWS Bedrock with Cerebras support gain a competitive advantage due to lower operational costs.

Fivefold increase in token throughput using Cerebras CS-3 on AWS Bedrock.

The deployment of Cerebras CS-3 on AWS Bedrock represents a significant step in solving the inference infrastructure problem that has become a bottleneck in the AI industry. The specialized architecture with divided prefill and decode phases allows for record performance while maintaining the manageability and scalability of the cloud service. For businesses, this means the ability to deploy AI applications with lower costs and better performance.

Часто задаваемые вопросы

What is Cerebras CS-3 and how is it different from regular GPUs?

Cerebras CS-3 is a specialized processor designed specifically for working with transformers and LLMs. Unlike universal GPUs (NVIDIA A100, H100) that try to optimize all types of computations, Cerebras WSE (Wafer Scale Engine) is optimized for matrix operations typical of neural networks. It contains 900 billion transistors on a single chip and provides significantly higher throughput for specific inference operations.

How does the architecture with separate prefill and decode improve performance?

The prefill phase processes the input context and fills the cache of keys and values, requiring high computational throughput. The decode phase generates output tokens one by one, requiring low latency. AWS Trainium is optimized for the first phase, and Cerebras WSE is optimized for the second. This separation allows for avoiding compromises in optimization and achieving a fivefold increase in overall throughput.

Which models are supported through AWS Bedrock with Cerebras?

AWS Bedrock with Cerebras support works with open LLMs and Amazon Nova models. This means that companies can use both their own open models and models provided by Amazon. Support for open models allows companies to avoid dependency on a single vendor and use models that they can independently develop and optimize.

How much does it cost to use Cerebras on AWS Bedrock?

AWS has not released detailed pricing information for Cerebras on Bedrock. However, given the fivefold increase in throughput, the cost of processing a single token is expected to be significantly lower than with standard GPUs. Companies should contact AWS for information on pricing and usage terms.

When will the deployment of Cerebras on AWS be available to all companies?

The deployment of Cerebras CS-3 on AWS Bedrock has been announced as a current initiative (March 2026). Typically, AWS gradually expands access to new services, starting with early adopters and gradually opening access to everyone. Companies interested in using it should contact AWS for information on availability in their region.

Читайте также

Источники

Источник фото: llm-stats.com