DeepSeek FAQ

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

If your news feed looks anything like mine, there is a lot about DeepSeek in it right now. Of all the rundowns on the recent release of the reasoning model DeepSeek R1, Ben Thompson's is the best I have read.

In it, he breaks down the story behind the model and its impact in at least three dimensions:

How the model was trained.
Possible consequences on the business side for Nvidia as well as other huge AI labs.
Possible consequences for the use of AI, long term.

That it is the US' Chip Ban (and how it limited DeepSeeks access to the latest GPUs from Nvidia) that forced DeepSeek to innovate in other ways had been mentioned in other reports as well. But the details Thompson highlights in the quote above goes into the specifics: When working to get more out of the H800s that DeepSeek can use, they left Nvidia's developer framework Cuda behind and went with the low-level programming language PTX instead.

Even though the Chip Ban is about physical goods, I am reminded of the failed "Cryptowars", where US tried to block cryptography (i.e. math) by regulation. And regulating knowledge turned out to be hard/impossible back then as well.

What this all means for AI? Thompson's analysis includes a tweet from Microsofts CEO Satya Nadella, who frames it all as an example of Jevon's paradox. Quoting from Wikipedia:

In economics, the Jevons paradox occurs when technological advancements make a resource more efficient to use (thereby reducing the amount needed for a single application), however, as the cost of using the resource drops, overall demand increases causing total resource consumption to rise.

Visit the source: stratechery.com

Previous: On DeepSeek and Export Controls
Next: Is your PIN code among the first ones hackers are likely to try?