Cerberas had a good IPO today. Their technology is very fast, but in terms of throughput, Nvidia NVL72 is 7x cheaper. FYI, Huawei current and last generation is even cheaper than NVIDIA per token. One advantage they have, though is that production is not dependent on HBM3/4 ram. A big disadvantage, is their software is difficult, and they don’t offer any models newer than about 1 year old, as they are slow in implementing bleeding edge optimizations. Long contexts also are extra slow relatively on cerberas.
Their 2nd customer is OpenAI. Under a $20B 3 year lease for up to 750mw of compute (equal to 6 years of 250mw blocks) the most optimistic cost per token possible for OpenAI is $10.53/m. 100% utilization. Real world realistic optimism is 20% capacity = over $50/m. OpenAI is initially using cluster to run Codex-Spark 5.3, which they charge customers $14/m tokens. OpenAI also has the privilege of paying for all OPEX. Power alone at just 7c/kwh, adds 50c/m tokens ideal.
Their first customer was UAE monarchy owned group, g42. Even if UAE has permission for NVidia, cerberas has quicker delivery, and UAE helped with/controls software. Apparently, Arabic has advantages on the chip, but they are still planning on Nvidia dominated based expansions with patriot air defense guarding systems.
I made a math mistake. Theoretical minimum cost to openAI is $3.15/m ($3.30/m with electricity) tokens, as cerebras has fixed context windows per user, and codex spark allows 3.33 concurrent users per node. That is still $16.50/m optimistic (20% of theoretical capacity) cost for $14/m revenue.
I guess there is a market for very fast response tasks. OpenAI does have a routing system that charges a high cost per token, but gets most of the work done by their smaller/cheaper models behind the scenes.
But, this turns out not to be ultra stupid if OpenAI has the internal training/improvement token workload to completely saturate the datacenter for its own use. Cerebras does have a training advantage over nvidia. It’s immature software stack only applies to cutting edge inference techniques.

