記事:How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking – OpenAI Triton And PyTorch 2.0

Over the last decade, the landscape of machine learning software development has undergone significant changes. Many frameworks have come and gone, but most have relied heavily on leveraging Nvidia’s CUDA and performed best on Nvidia GPUs. However, with the arrival of PyTorch 2.0 and OpenAI’s Triton, Nvidia’s dominant position in this field, mainly due to its software moat, is being disrupted.

This report will touch on topics such as why Google’s TensorFlow lost out to PyTorch, why Google hasn’t been able to capitalize publicly on its early leadership of AI, the major components of machine learning model training time, the memory capacity/bandwidth/cost wall, model optimization, why other AI hardware companies haven’t been able to make a dent in Nvidia’s dominance so far, why hardware will start to matter more, how Nvidia’s competitive advantage in CUDA is wiped away, and a major win one of Nvidia’s competitors has at a large cloud for training silicon.

The 1,000-foot summary is that the default software stack for machine learning models will no longer be Nvidia’s closed-source CUDA. The ball was in Nvidia’s court, and they let OpenAI and Meta take control of the software stack. That ecosystem built its own tools because of Nvidia’s failure with their proprietary tools, and now Nvidia’s moat will be permanently weakened.

過去10年間、機械学習ソフトウェア開発の状況は大きく変化してきました。多くのフレームワークが登場しては消えていったが、そのほとんどはNvidiaのCUDAを活用することに大きく依存し、Nvidia GPUで最高のパフォーマンスを発揮してきた。しかし、PyTorch 2.0やOpenAIのTritonの登場により、この分野でのNvidiaの、主にソフトウェアの堀による支配的な地位は崩れつつある。

本レポートでは、GoogleのTensorFlowがPyTorchに負けた理由、GoogleがAIの初期のリーダーシップを公に活かせなかった理由、機械学習モデルのトレーニング時間の主要な構成要素、メモリ容量/バンド幅/コストの壁、モデルの最適化などのトピックに触れます、 他のAIハードウェア企業がこれまでNvidiaの優位性に食い込めなかった理由、ハードウェアがより重要になり始める理由、CUDAにおけるNvidiaの競争優位性が払拭される方法、Nvidiaの競合他社の1社がトレーニング用シリコン用の大規模クラウドで得た大きな勝利などです。


TensorFlow vs. PyTorch
A handful of years ago, the framework ecosystem was quite fragmented, but TensorFlow was the frontrunner. Google looked poised to control the machine learning industry. They had a first movers’ advantage with the most commonly used framework, TensorFlow, and by designing/deploying the only successful AI application-specific accelerator, TPU.

Instead, PyTorch won. Google failed to convert its first mover’s advantage into dominance of the nascent ML industry. Nowadays, Google is somewhat isolated within the machine learning community because of its lack of use of PyTorch and GPUs in favor of its own software stack and hardware. In typical Google fashion, they even have a 2nd framework called Jax that competes directly with TensorFlow.

There’s even endless talk of Google’s dominance in search and natural language processing waning due to large language models, particularly those from OpenAI and the various startups that utilize OpenAI APIs or are building similar foundational models. While we believe this doom and gloom is overblown, that story is for another day. Despite these challenges, Google is still at the forefront of the most advanced machine learning models. They invented transformers and remain state-of-the-art in many areas (PaLM, LaMBDA, Chinchilla, MUM, TPU).

Back to why PyTorch won. While there was an element of wrestling control away from Google, it was primarily due to its increased flexibility and usability of PyTorch versus TensorFlow. If we boil it down to a first principal level, PyTorch differed from TensorFlow in using “Eager mode” rather than “Graph Mode.”

Eager mode can be thought of as a standard scripting execution method. The deep learning framework executes each operation immediately, as it is called, line by line, like any other piece of Python code. This makes debugging and understanding your code more accessible, as you can see the results of intermediate operations and see how your model behaves.

In contrast, graph mode has two phases. The first phase is the definition of a computation graph representing the operations to perform. A computation graph is a series of interconnected nodes representing operations or variables, and the edges between nodes represent the data flow between them. The second phase is the deferred execution of an optimized version of the computation graph.

This two-phase approach makes it more challenging to understand and debug your code, as you cannot see what is happening until the end of the graph execution. This is analogous to “interpreted” vs. “compiled” languages, like python vs. C++. It’s easier to debug Python, largely since it’s interpreted.

While TensorFlow now has Eager mode by default, the research community and most large tech firms have settled around PyTorch. This is exemplified by the fact that nearly ever generative AI model that made the news, being based on PyTorch. The Google generative AI models are based on Jax, not TensorFlow.

Of course there is a long tail of image nets using other frameworks like TensorFlow and Keras, but the compute budgets for new model development is all flowing to PyTorch models. For a deeper explanation of why PyTorch won, see here. In general, if you walk around the halls of NeurIPS (the main AI conference), all generative AI, the non-Google work is with PyTorch.




なぜPyTorchが勝ったのかに話を戻す。Googleから主導権を奪い取ったという要素もありますが、主にTensorFlowに対してPyTorchの柔軟性と使い勝手が向上したことが原因です。第一原理レベルにまで煮詰めると、PyTorchはTensorFlowと異なり、”グラフモード “ではなく “イーガーモード “を使っていました。






Machine Learning Training Components
If we boil machine learning model training to its most simplistic form, there are two major time components in a machine learning model’s training time.

Compute (FLOPS): Running dense matrix multiplication within each layer

Memory (Bandwidth): Waiting for data or layer weights to get to the compute resources. Common examples of bandwidth-constrained operations are various normalizations, pointwise operations, SoftMax, and ReLU.

In the past, the dominant factor in machine learning training time was compute time, waiting for matrix multiplies. As Nvidia’s GPUs continued to develop, this quickly faded away from being the primary concern.

Nvidia’s FLOPS have increased multiple orders of magnitude by leveraging Moore’s Law, but primarily architectural changes such as the tensor core and lower precision floating point formats. In contrast, memory has not followed the same path.

If we go back to 2018, when the BERT model was state of the art, and the Nvidia V100 was the most advanced GPU, we could see that matrix multiplication was no longer the primary factor for improving a model’s performance. Since then, the most advanced models have grown 3 to 4 orders of magnitude in parameter count, and the fastest GPUs have grown an order of magnitude in FLOPS.

Even in 2018, purely compute-bound workloads made up 99.8% of FLOPS but only 61% of the runtime. The normalization and pointwise ops achieve 250x less FLOPS and 700x less FLOPS than matrix multiplications, respectively, yet they consume nearly 40% of the model’s runtime.


計算(FLOPS): 各層で密な行列の乗算を実行する

メモリ(Bandwidth): メモリ(帯域幅):データやレイヤーの重みが計算リソースに到達するのを待つ。帯域幅に制約のある演算の一般的な例として、さまざまな正規化、ポイントワイズ演算、SoftMax、ReLUがあります。



BERTモデルが最先端であり、Nvidia V100が最先端のGPUであった2018年まで遡ると、行列の乗算がモデルの性能を向上させる主要因ではなくなっていることがわかります。それ以来、最先端のモデルはパラメータ数で3~4桁、最速のGPUはFLOPSで1桁の成長を遂げています。


The Memory Wall
As models continue to soar in size, large language models take 100s gigabytes, if not terabytes, for the model weights alone. Production recommendation networks deployed by Baidu and Meta require dozens of terabytes of memory for their massive embedding tables. A huge chunk of the time in large model training/inference is not spent computing matrix multiplies, but rather waiting for data to get to the compute resources. The obvious question is why don’t architects put more memory closer to the compute. The answer is $$$.

Memory follows a hierarchy from close and fast to slow and cheap. The nearest shared memory pool is on the same chip and is generally made of SRAM. Some machine-learning ASICs attempt to utilize huge pools of SRAM to hold model weights, but there are issues with this approach. Even Cerebras’ ~$2,500,000 wafer scale chips only have 40GB of SRAM on the chip. There isn’t enough memory capacity to hold the weights of a 100B+ parameter model.

Nvidia’s architecture has always used a much smaller amount of memory on the die. The current generation A100 has 40MB, and the next generation H100 has 50MB. 1GB of SRAM on TSMC’s 5nm process node would require ~200mm^2 of silicon. Once the associated control logic/fabric are implemented, that would require over 400mm^2 of silicon, or about 50% of the total logic area of an Nvidia datacenter GPU. Given that an A100 GPU costs $10k+ and the H100 is more like $20k+, economically, this is infeasible. Even when you ignore Nvidia’s ~75% gross margin on datacenter GPUs (~4x markup), the cost per GB of SRAM memory would still be in the $100s for a fully yielded product.

Furthermore, the cost of on-chip SRAM memory will not decrease much through conventional Moore’s Law process technology shrinks. The same 1GB of memory actually costs more with the next-generation TSMC 3nm process technology. While 3D SRAM will help with SRAM costs to some degree, that is only a temporary bend of the curve.

The next step down in the memory hierarchy is tightly coupled off-chip memory, DRAM. DRAM has an order magnitude higher latency than SRAM (~>100 nanoseconds vs. ~10 nanoseconds), but it’s also much cheaper ($1s a GB vs. $100s GB.)

DRAM followed the path of Moore’s Law for many decades. When Gordon Moore coined the term, Intel’s primary business was DRAM. His economic prediction about density and cost of transistors generally held true until ~2009 for DRAM. Since ~2012 though, the cost of DRAM has barely improved.

The demands for memory have only increased. DRAM now comprises 50% of the total server’s cost. This is the memory wall, and it has shown up in products. Comparing Nvidia’s 2016 P100 GPU to their 2022 H100 GPU that is just starting to ship, there is a 5x increase in memory capacity (16GB -> 80GB) but a 46x increase in FP16 performance (21.2 TFLOPS -> 989.5 TFLOPS).

While capacity is a significant bottleneck, it is intimately tied to the other major bottleneck, bandwidth. Increased memory bandwidth is generally obtained through parallelism. While standard DRAM is only a few dollars per GB today, to get the massive bandwidth machine learning requires, Nvidia uses HBM memory, a device comprised of 3D stacked layers of DRAM that requires more expensive packaging. HBM is in the $10 to $20 a GB range, including packaging and yield costs.

The cost constraints of memory bandwidth and capacity show up in Nvidia’s A100 GPUs constantly. The A100 tends to have very low FLOPS utilization without heavy optimization. FLOPS utilization measures the total computed FLOPS required to train a model vs. the theoretical FLOPS the GPUs could compute in a model’s training time.

Even with heavy optimizations from leading researchers, 60% FLOPS utilization is considered a very high utilization rate for large language model training. The rest of the time is overhead, idle time spent waiting for data from another calculation/memory, or recomputing results just in time to reduce memory bottlenecks.

From the current generation A100 to the next generation H100, the FLOPS grow by more than 6X, but memory bandwidth only grows by 1.65x. This has led to many fears of low utilization for H100. The A100 required many tricks to get around the memory wall, and more will need to be implemented with the H100.

The H100 brings distributed shared memory and L2 multicast to Hopper. The idea is that different SMs (think cores) can write directly to another SM’s SRAM (shared memory/L1 Cache). This effectively increases the size of the cache and reduces the required bandwidth of DRAM read/writes. Future architectures will rely on sending fewer operations to memory to minimize the impact of the memory wall. It should be noted that larger models tend to achieve higher utilization rates as FLOPS demands scale more exponentially whereas memory bandwidth and capacity demands tend to scale more linearly.




さらに、オンチップSRAMメモリのコストは、従来のムーアの法則に基づくプロセス技術の縮小によっても、それほど低下することはありません。同じ1GBのメモリでも、次世代のTSMC 3nmプロセス技術では、実際にはもっとコストがかかる。3D SRAMはSRAMのコストにある程度貢献するだろうが、それは一時的にカーブを曲げたに過ぎない。



しかし、メモリに対する要求は高まるばかりです。現在、DRAMはサーバーの総コストの50%を占めています。これがメモリの壁であり、製品にも表れている。Nvidiaの2016年のP100 GPUと、出荷が始まったばかりの2022年のH100 GPUを比較すると、メモリ容量は5倍(16GB→80GB)だが、FP16性能は46倍(21.2 TFLOPS→989.5 TFLOPS)である。

容量は重要なボトルネックですが、もう一つの大きなボトルネックである帯域幅と密接に結びついています。メモリ帯域幅の拡大は、一般に並列化によって得られる。標準的な DRAM の価格は 1GB あたり数ドルですが、機械学習が必要とする膨大な帯域幅を得るために、Nvidia 社は HBM メモリを使用しています。このデバイスは、DRAM の 3D スタック層で構成されており、より高価なパッケージが必要です。HBMは、パッケージングと歩留まりコストを含めて、1GBあたり10ドルから20ドルの範囲にある。

メモリの帯域幅と容量というコストの制約は、NvidiaのA100 GPUに常に現れています。A100は、高度な最適化を行わないと、FLOPS利用率が非常に低くなる傾向がある。FLOPS利用率は、モデルのトレーニングに必要な総計算FLOPSと、GPUがモデルのトレーニング時間内に計算できる理論FLOPSを比較したものです。




Operator Fusion – The Workaround
Just like with training ML models, knowing what regime you’re in allows you to narrow in on optimizations that matters. For example, if you’re spending all of your time doing memory transfers (i.e. you are in a memory-bandwidth bound regime), then increasing the FLOPS of your GPU won’t help. On the other hand, if you’re spending all of your time performing big chonky matmuls (i.e. a compute-bound regime), then rewriting your model logic into C++ to reduce overhead won’t help.


Referring back to why PyTorch won, it was the increased flexibility and usability due to Eager mode, but moving to Eager mode isn’t all sunshine and rainbows. When executing in Eager mode, each operation is read from memory, computed, then sent to memory before the next operation is handled. This significantly increases the memory bandwidth demands if heavy optimizations aren’t done.

As such, one of the principal optimization methods for a model executed in Eager mode is called operator fusion. Instead of writing each intermediate result to memory, operations are fused, so multiple functions are computed in one pass to minimize memory reads/writes. Operator fusion improves operator dispatch, memory bandwidth, and memory size costs.

This optimization often involves writing custom CUDA kernels, but that is much more difficult than using simple python scripts. As a built-in compromise, PyTorch steadily implemented more and more operators over time natively within PyTorch. Many of these operators were simply multiple commonly used operations fused into a single, more complex function.

The increase in operators made both creating the model within PyTorch easier and the performance of Eager mode faster due to having fewer memory read/writes. The downside was that PyTorch ballooned to over 2,000 operators over a few years.

We would say software developers are lazy, but let’s be honest, almost all people are lazy. If they get used to one of the new operators within PyTorch, they will continue to use that. The developer may not even recognize the performance improvement but instead, use that operator because it means writing less code.

Additionally, not all operations can be fused. A significant amount of time is often spent deciding which operations to fuse and which operations to assign to specific compute resources at the chip and cluster levels. The strategy of which operations to fuse where, although generally similar, does vary significantly depending on the architecture.

Nvidia Is King
The growth in operators and position as the default has helped Nvidia as each operator was quickly optimized for their architecture but not for any other hardware. If an AI hardware startup wanted to fully implement PyTorch, that meant supporting the growing list of 2,000 operators natively with high performance.

The talent level required to train a massive model with high FLOPS utilization on a GPU grows increasingly higher because of all the tricks needed to extract maximum performance. Eager mode execution plus operator fusion means that software, techniques, and models that are developed are pushed to fit within the ratios of compute and memory that the current generation GPU has.

Everyone developing machine learning chips is beholden to the same memory wall. ASICs are beholden to supporting the most commonly used frameworks. ASICs are beholden to the default development methodology, GPU-optimized PyTorch code with a mix of Nvidia and external libraries. An architecture that eschews a GPU’s various non-compute baggage in favor of more FLOPS and a stiffer programming model makes very little sense in this context.

Ease of use is king.

The only way to break the vicious cycle is for the software that runs models on Nvidia GPUs to transfer seamlessly to other hardware with as little effort as possible. As model architectures stabilize and abstractions from PyTorch 2.0, OpenAI Triton, and MLOps firms such as MosaicML become the default, the architecture and economics of the chip solution starts to become the biggest driver of the purchase rather than the ease of use afforded to it by Nvidia’s superior software.

オペレーターフュージョン – 回避策












悪循環を断ち切る唯一の方法は、Nvidia GPU上でモデルを実行するソフトウェアが、できるだけ少ない労力で他のハードウェアにシームレスに移行できるようにすることです。モデルのアーキテクチャが安定し、PyTorch 2.0、OpenAI Triton、MosaicMLなどのMLOps企業の抽象化がデフォルトになると、チップソリューションのアーキテクチャと経済性が、Nvidiaの優れたソフトウェアによって与えられる使いやすさよりも、購入の最大の要因になり始めるのである。

PyTorch 2.0
The PyTorch Foundation was established and moved out from under the wings of Meta just a few months ago. Alongside this change to an open development and governance model, 2.0 has been released for early testing with full availability in March. PyTorch 2.0 brings many changes, but the primary difference is that it adds a compiled solution that supports a graph execution model. This shift will make properly utilizing various hardware resources much easier.

PyTorch 2.0 brings an 86% performance improvement for training on Nvidia’s A100 and 26% on CPUs for inference! This dramatically reduces the compute time and cost required for training a model. These benefits could extend to other GPUs and accelerators from AMD, Intel, Tenstorrent, Luminous Computing, Tesla, Google, Amazon, Microsoft, Marvell, Meta, Graphcore, Cerebras, SambaNova, etc.

The performance improvements from PyTorch 2.0 will be larger for currently unoptimized hardware. Meta and other firms’ heavy contribution to PyTorch stems from the fact that they want to make it easier to achieve higher FLOPS utilization with less effort on their multi-billion-dollar training clusters made of GPUs. They are also motivated to make their software stacks more portable to other hardware to introduce competition to the machine learning space.

PyTorch 2.0 also brings advancements to distributed training with better API support for data parallelism, sharding, pipeline parallelism, and tensor parallelism. In addition, it supports dynamic shapes natively through the entire stack, which among many other examples, makes varying sequence lengths for LLMs much easier to support. This is the first time a major compiler supports Dynamic Shapes from training to inference.

Writing a performant backend for PyTorch that fully supports all 2,000+ operators has been difficult for every machine learning ASIC except for Nvidia GPUs. PrimTorch brings the number of operators down to ~250 primitive operators while also keeping usability unchanged for end users of PyTorch. PrimTorch makes the implementation of different, non-Nvidia backends to PyTorch much simpler and more accessible. Custom hardware and system vendors can bring up their software stacks more easily.

Moving to graph mode requires a robust graph definition. Meta and PyTorch have been attempting to work on implementing this for ~5 years, but every solution they came up with had significant drawbacks. They finally cracked the puzzle with TorchDynamo. TorchDynamo will ingest any PyTorch user script, including those that call outside 3rd party libraries, and generate an FX graph.

Dynamo lowers all complex operations to the ~250 primitive operations in PrimTorch. Once the graph is formed, unused operations are discarded, and the graph determines which intermediate operations need to be stored or written to memory and which can potentially be fused. This dramatically reduces the overhead within a model while also being seamless for the user.

TorchDynamo already works for over 99% of the 7,000 PyTorch models tested, including those from OpenAI, HuggingFace, Meta, Nvidia, Stability.AI, and more, without any changes to the original code. The 7,000 models tested were indiscriminately chosen from the most popular projects using PyTorch on GitHub.

Google’s TensorFlow/Jax and other graph mode execution pipelines generally require the user to ensure their model fits into the compiler architecture so that the graph can be captured. Dynamo changes this by enabling partial graph capture, guarded graph capture, and just-in-time recapture.

Partial graph capture allows the model to include unsupported/non-python constructs. When a graph cannot be generated for that portion of the model, a graph break is inserted, and the unsupported constructs will be executed in eager mode between the partial graphs.

Guarded graph capture checks if the captured graph is valid for execution. A guard is a change that would require recompilation. This is important because running the same code multiple times won’t recompile multiple times.

Just-in-time recapture allows the graph to be recaptured if the captured graph is invalid for execution.

PyTorch’s goal is to create a unified front end with a smooth UX that leverages Dynamo to generate graphs. The user experience of this solution would be unchanged, but the performance can be significantly improved. Capturing the graph means execution can be parallelized more efficiently over a large base of compute resources.

Dynamo and AOT Autograd then pass the optimized FX graphs to the PyTorch native compiler level, TorchInductor. Hardware companies can also take this graph and input it into their own backend compilers.

PyTorch 2.0
数ヶ月前にPyTorch Foundationが設立され、Metaの翼の下を離れました。このオープンな開発とガバナンスモデルへの変更と同時に、2.0が早期テストのためにリリースされ、3月に完全な利用が可能になりました。PyTorch 2.0は多くの変更をもたらしますが、主な違いは、グラフ実行モデルをサポートするコンパイルされたソリューションが追加されたことです。このシフトにより、様々なハードウェアリソースを適切に利用することがより容易になります。

PyTorch 2.0は、NvidiaのA100でのトレーニングで86%、推論ではCPUで26%の性能向上をもたらしています!これにより、モデルのトレーニングに必要な計算時間とコストを劇的に削減することができます。これらの利点は、AMD、Intel、Tenstorrent、Luminous Computing、Tesla、Google、Amazon、Microsoft、Marvell、Meta、Graphcore、Cerebras、SambaNovaなどの他のGPUやアクセラレータにも適用可能です。

PyTorch 2.0による性能向上は、現在最適化されていないハードウェアではより大きくなります。Meta社やその他の企業がPyTorchに大きく貢献しているのは、GPUで構成された数十億ドル規模のトレーニングクラスタにおいて、より少ない労力でより高いFLOPS利用率を達成することを容易にしたいからであります。また、機械学習分野に競争をもたらすために、自社のソフトウェアスタックを他のハードウェアにもっと移植できるようにしたいという動機もあるようです。

PyTorch 2.0では、データ並列、シャーディング、パイプライン並列、テンソル並列のAPIサポートが強化され、分散学習にも進化を遂げています。さらに、スタック全体を通してネイティブに動的な形状をサポートすることで、LLMのシーケンス長を変化させることがより簡単になりました。主要なコンパイラが学習から推論までダイナミックシェイプをサポートするのは、これが初めてです。

2,000以上の演算子を完全にサポートするPyTorch用の高性能なバックエンドを書くことは、Nvidia GPUを除くすべての機械学習ASICにとって難しいことでした。PrimTorchは、PyTorchのエンドユーザーにとって使い勝手が変わらないようにしながら、演算子の数を〜250個のプリミティブ演算子まで減らしています。PrimTorchは、PyTorchへの異なる非Nvidiaのバックエンドの実装をよりシンプルでアクセスしやすいものにします。カスタムハードウェアやシステムベンダーは、より簡単にソフトウェアスタックを導入することができます。









DynamoとAOT Autogradは、最適化されたFXグラフをPyTorchのネイティブコンパイラレベルであるTorchInductorに渡します。ハードウェアメーカーもこのグラフを受け取り、自社のバックエンドコンパイラに入力することができます。

TorchInductor is a python native deep learning compiler that generates fast code for multiple accelerators and backends. Inductor will take the FX graphs, which have ~250 operators, and lowers them to ~50 operators. Inductor then moves to a scheduling phase where operators are fused, and memory planning is determined.

Inductor then goes to the “Wrapper Codegen,” which generates code that runs on the CPU, GPU, or other AI accelerators. The wrapper codegen replaces the interpreter part of a compiler stack and can call kernels and allocate memory. The backend code generation portion leverages OpenAI Triton for GPUs and outputs PTX code. For CPUs, an Intel compiler generates C++ (will work on non-Intel CPUs too).

More hardware will be supported going forward, but the key is that Inductor dramatically reduces the amount of work a compiler team must do when making a compiler for their AI hardware accelerator. Furthermore, the code is more optimized for performance. There are significant reductions in memory bandwidth and capacity requirements.

We didn’t want to build a compiler that only supported GPUs. We wanted something that could scale to support a wide variety of hardware back ends, and having a C++ as well as [OpenAI] Triton forces that generality.

Jason Ansel – Meta AI

OpenAI Triton
OpenAI’s Triton is very disruptive angle to Nvidia’s closed-source software moat for machine learning. Triton takes in Python directly or feeds through the PyTorch Inductor stack. The latter will be the most common use case. Triton then converts the input to an LLVM intermediate representation and then generates code. In the case of Nvidia GPUs, it directly generates PTX code, skipping Nvidia’s closed-source CUDA libraries, such as cuBLAS, in favor of open-source libraries, such as cutlass.

CUDA is commonly used by those specializing in accelerated computing, but it is less well-known among machine learning researchers and data scientists. It can be challenging to use efficiently and requires a deep understanding of the hardware architecture, which can slow down the development process. As a result, machine learning experts may rely on CUDA experts to modify, optimize, and parallelize their code.

Triton bridges the gap enabling higher-level languages to achieve performance comparable to those using lower-level languages. The Triton kernels themselves are quite legible to the typical ML researcher which is huge for usability. Triton automates memory coalescing, shared memory management, and scheduling within SMs. Triton is not particularly helpful for the element-wise matrix multiplies, which are already done very efficiently. Triton is incredibly useful for costly pointwise operations and reducing overhead from more complex operations such as Flash Attention that involve matrix multiplies as a portion of a larger fused operation.

OpenAI Triton only officially supports Nvidia GPUs today, but that is changing in the near future. Multiple other hardware vendors will be supported in the future, and this open-source project is gaining incredible steam. The ability for other hardware accelerators to integrate directly into the LLVM IR that is part of Triton dramatically reduces the time to build an AI compiler stack for a new piece of hardware.

Nvidia’s colossal software organization lacked the foresight to take their massive advantage in ML hardware and software and become the default compiler for machine learning. Their lack of focus on usability is what enabled outsiders at OpenAI and Meta to create a software stack that is portable to other hardware. Why aren’t they the one building a « simplified » CUDA like Triton for ML researchers? Stuff like Flash Attention, why does it come out of Ph.D. students and not Nvidia?

The rest of this report will point out the specific hardware accelerator that has a huge win at Microsoft, as well as multiple companies’ hardware that is quickly being integrated into the PyTorch 2.0/OpenAI Trion software stack. Furthermore, it will share the opposing view as a defense of Nvidia’s moat/strength in the AI training market.


そして、CPUやGPUなどのAIアクセラレータ上で動作するコードを生成する「ラッパーコードジェン」へと移行する。ラッパーコードゲンは、コンパイラスタックのインタプリタ部分を置き換えるもので、カーネルの呼び出しやメモリの確保が可能です。バックエンドコード生成部は、GPU向けにはOpenAI Tritonを活用し、PTXコードを出力する。CPU向けには、IntelコンパイラがC++を生成します(Intel以外のCPUでも動作します)。



ジェイソン・アンセル – Meta AI

OpenAI Triton
OpenAIのTritonは、Nvidiaの機械学習のためのクローズドソースソフトウェアの堀に対して、非常に破壊的な角度を持っています。TritonはPythonを直接取り込むか、PyTorch Inductorスタックを介してフィードされます。後者が最も一般的な使用例でしょう。そして、Tritonは入力をLLVMの中間表現に変換し、コードを生成します。Nvidia GPUの場合、cuBLASなどのNvidiaのクローズドソースCUDAライブラリをスキップして、cutlassなどのオープンソースライブラリを優先し、直接PTXコードを生成します。


Tritonは、このギャップを埋めることで、より高いレベルの言語が、より低いレベルの言語を使用する場合と同等の性能を達成できるようにします。Tritonのカーネルは、一般的なML研究者にとって非常に読みやすく、ユーザビリティに優れています。Tritonは、SM内のメモリ合体、共有メモリ管理、スケジューリングを自動化します。Tritonは、すでに非常に効率的に行われている要素ごとの行列乗算には特に役に立ちません。Tritonは、コストのかかるポイントワイズ演算や、Flash Attentionのような複雑な演算で、より大きな融合演算の一部として行列乗算を行う場合のオーバーヘッドを削減するのに非常に有効です。

OpenAI Tritonは、現在Nvidia GPUのみを公式にサポートしていますが、それは近い将来変更される予定です。将来的には、他の複数のハードウェアベンダーがサポートされる予定で、このオープンソースプロジェクトは信じられないほどの勢いを増しています。他のハードウェアアクセラレータがTritonの一部であるLLVM IRに直接統合できるため、新しいハードウェア用のAIコンパイラスタックを構築する時間が劇的に短縮される。


このレポートの残りの部分では、Microsoftで大成功を収めた特定のハードウェアアクセラレータと、PyTorch 2.0/OpenAI Trionソフトウェアスタックに迅速に統合されている複数の企業のハードウェアを指摘します。さらに、AIトレーニング市場におけるNvidiaの堀/強さの防御として、反対意見も共有する予定である。






グラフベースの計算: TensorFlowでは、計算をグラフ構造として表現します。ノードは数値計算の操作を表し、エッジはデータのフローを表します。このグラフ構造により、計算の効率化や分散処理が容易になります。

柔軟なモデル構築: TensorFlowは、畳み込みニューラルネットワーク(CNN)、再帰型ニューラルネットワーク(RNN)、生成モデル、強化学習など、さまざまなモデルの構築をサポートしています。また、Kerasと統合されており、高レベルのAPIを使用して簡単にモデルを構築することもできます。

自動微分: TensorFlowは、自動微分(Automatic Differentiation)をサポートしています。これにより、モデルの勾配を効率的に計算することができます。勾配は、パラメータの最適化(学習)に使用され、モデルのパフォーマンスを向上させるために重要な役割を果たします。

グラフの実行と最適化: TensorFlowは、計算グラフを実行するための実行エンジンを提供します。また、グラフの最適化や分散処理の機能も備えており、高速かつ効率的な計算を実現します。

多様な言語サポート: TensorFlowはPythonを主要な開発言語としてサポートしていますが、他の言語(C++, Java, Goなど)でも利用することができます。これにより、さまざまな開発環境やプラットフォームでTensorFlowを使用することができます。

モデルのデプロイ: TensorFlowは、モデルのトレーニングとデプロイを統合してサポートしています。トレーニングしたモデルは、TensorFlow Servingを使用して本番環境でデプロイすることができます。また、TensorFlow Liteを使用してモバイルデバイスや組み込みデバイスに最適化されたモデルをデプロイすることも可能です。

コミュニティとサポート: TensorFlowは非常に広範なコミュニティを持っており、ドキュメンテーション、チュートリアル、サンプルコード、モデルのプレトレーニングなどが豊富に提供されています。さらに、Stack OverflowやGitHubなどのオンラインコミュニティでのサポートも充実しています。






インタプリタ型言語: インタプリタ型言語は、プログラムを実行する際にソースコードを逐次的に解釈しながら実行します。ソースコードはプログラムの命令を直接実行するため、実行時に解釈されます。一行ずつ解釈され、実行されるため、開発者はプログラムを逐次的に修正し、即座に結果を確認することができます。代表的なインタプリタ型言語には、PythonやRubyがあります。

コンパイル型言語: コンパイル型言語は、プログラムを実行する前にソースコードをコンパイルしてバイナリコード(機械語や中間言語)に変換します。この変換は、プログラム全体がコンパイルされる前に行われます。変換されたバイナリコードは、実行時に直接実行されるため、高速な実行が可能です。代表的なコンパイル型言語には、CやC++があります。



CUDA(Compute Unified Device Architecture)は、NVIDIAが提供するGPU向けの並列コンピューティングプラットフォームおよびプログラミングモデルです。


GPUプログラミング: CUDAは、GPUを直接プログラムするためのフレームワークです。従来のCPUに比べて高い並列処理能力を持つGPUを活用し、データ並列タスクを効率的に実行することができます。

CUDAコア: CUDAは、GPU内のCUDAコアと呼ばれる並列処理ユニットを使用して計算を実行します。CUDAコアは、大量のデータを同時に処理することができるため、並列計算に非常に適しています。

プログラミングモデル: CUDAプログラミングモデルでは、CやC++などの一般的なプログラミング言語を使用してGPU上での並列計算を記述します。CUDAプログラムは、ホスト(CPU)とデバイス(GPU)の間で相互作用することができます。

CUDAランタイムとドライバ: CUDAは、CUDAランタイムとCUDAドライバの組み合わせで構成されます。ランタイムは、CUDAプログラムを実行するためのランタイムライブラリやツールを提供し、ドライバはGPUとの通信を制御します。

CUDAエコシステム: CUDAは、広範なエコシステムを持っており、様々な分野で利用されています。機械学習、科学計算、画像処理、仮想現実などの領域で、CUDAを活用した高性能なアプリケーションやライブラリが開発されています。