引用:Software 2.0

I sometimes see people refer to neural networks as just “another tool in your machine learning toolbox”. They have some pros and cons, they work here or there, and sometimes you can use them to win Kaggle competitions. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software. They are Software 2.0.

The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer identifies a specific point in program space with some desirable behavior.

In contrast, Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried).

Instead, our approach is to specify some goal on the behavior of a desirable program (e.g., “satisfy a dataset of input output pairs of examples”, or “win a game of Go”), write a rough skeleton of the code (i.e. a neural net architecture) that identifies a subset of program space to search, and use the computational resources at our disposal to search this space for a program that works. In the case of neural networks, we restrict the search to a continuous subset of the program space where the search process can be made (somewhat surprisingly) efficient with backpropagation and stochastic gradient descent.


To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets. This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labelers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces.

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program. Because of this and many other benefits of Software 2.0 programs that I will go into below, we are witnessing a massive transition across the industry where of a lot of 1.0 code is being ported into 2.0 code. Software (1.0) is eating the world, and now AI (Software 2.0) is eating software.



これに対し、ソフトウェア2.0は、より抽象的で人間に馴染みにくい言語で書かれている。 例えば、ニューラルネットワークの重みのようなものです。このコードを書くのに人間は関与しません。なぜなら、重みは大量にあり(典型的なネットワークは数百万個あるかもしれません)、重みで直接コーディングするのはちょっと難しいからです(私も試しました)。




Ongoing transition
Let’s briefly examine some concrete examples of this ongoing transition. In each of these areas we’ve seen improvements over the last few years when we give up on trying to address a complex problem by writing explicit code and instead transition the code into the 2.0 stack.

Visual Recognition used to consist of engineered features with a bit of machine learning sprinkled on top at the end (e.g., an SVM). Since then, we discovered much more powerful visual features by obtaining large datasets (e.g. ImageNet) and searching in the space of Convolutional Neural Network architectures. More recently, we don’t even trust ourselves to hand-code the architectures and we’ve begun searching over those as well.

Speech recognition used to involve a lot of preprocessing, gaussian mixture models and hidden markov models, but today consist almost entirely of neural net stuff. A very related, often cited humorous quote attributed to Fred Jelinek from 1985 reads “Every time I fire a linguist, the performance of our speech recognition system goes up”.

Speech synthesis has historically been approached with various stitching mechanisms, but today the state of the art models are large ConvNets (e.g. WaveNet) that produce raw audio signal outputs.

Machine Translation has usually been approaches with phrase-based statistical techniques, but neural networks are quickly becoming dominant. My favorite architectures are trained in the multilingual setting, where a single model translates from any source language to any target language, and in weakly supervised (or entirely unsupervised) settings.

Games. Explicitly hand-coded Go playing programs have been developed for a long while, but AlphaGo Zero (a ConvNet that looks at the raw state of the board and plays a move) has now become by far the strongest player of the game. I expect we’re going to see very similar results in other areas, e.g. DOTA 2, or StarCraft.

Databases. More traditional systems outside of Artificial Intelligence are also seeing early hints of a transition. For instance, “The Case for Learned Index Structures” replaces core components of a data management system with a neural network, outperforming cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory.

You’ll notice that many of my links above involve work done at Google. This is because Google is currently at the forefront of re-writing large chunks of itself into Software 2.0 code. “One model to rule them all” provides an early sketch of what this might look like, where the statistical strength of the individual domains is amalgamated into one consistent understanding of the world.



音声認識には、以前は多くの前処理、ガウス混合モデル、隠れマルコフモデルが必要でしたが、現在ではほとんどニューラルネットのものだけで構成されています。1985年にFred Jelinekが発表した「私が言語学者をクビにするたびに、音声認識システムの性能は上がる」というユーモラスな言葉がありますが、これと非常に関連性があります。



ゲーム 明示的にハンドコードされた囲碁のプレイプログラムは長い間開発されてきましたが、AlphaGo Zero(盤面の生の状態を見て手を打つConvNet)は現在、このゲームの圧倒的な最強プレイヤーになっています。他の分野でも、例えばDOTA 2やStarCraftなど、非常に似たような結果が出るのではないかと期待しています。

データベース。人工知能以外の、より伝統的なシステムにも、移行の初期段階でのヒントが見受けられます。例えば、「The Case for Learned Index Structures」では、データ管理システムのコアコンポーネントをニューラルネットワークに置き換え、キャッシュに最適化されたB-Treeを最大70%上回るスピードと1桁多いメモリ使用量の削減を実現しています。

上記のリンクの多くが、Googleで行われた仕事であることにお気づきでしょう。これは、Googleが現在、自分自身の大きな塊をSoftware 2.0コードに書き換える最前線にいるからです。「個々のドメインの統計的な強さが、世界に関する1つの一貫した理解に統合されるのです。

The benefits of Software 2.0
Why should we prefer to port complex programs into Software 2.0? Clearly, one easy answer is that they work better in practice. However, there are a lot of other convenient reasons to prefer this stack. Let’s take a look at some of the benefits of Software 2.0 (think: a ConvNet) compared to Software 1.0 (think: a production-level C++ code base). Software 2.0 is:

Computationally homogeneous. A typical neural network is, to the first order, made up of a sandwich of only two operations: matrix multiplication and thresholding at zero (ReLU). Compare that with the instruction set of classical software, which is significantly more heterogenous and complex. Because you only have to provide Software 1.0 implementation for a small number of the core computational primitives (e.g. matrix multiply), it is much easier to make various correctness/performance guarantees.

Simple to bake into silicon. As a corollary, since the instruction set of a neural network is relatively small, it is significantly easier to implement these networks much closer to silicon, e.g. with custom ASICs, neuromorphic chips, and so on. The world will change when low-powered intelligence becomes pervasive around us. E.g., small, inexpensive chips could come with a pretrained ConvNet, a speech recognizer, and a WaveNet speech synthesis network all integrated in a small protobrain that you can attach to stuff.

Constant running time. Every iteration of a typical neural net forward pass takes exactly the same amount of FLOPS. There is zero variability based on the different execution paths your code could take through some sprawling C++ code base. Of course, you could have dynamic compute graphs but the execution flow is normally still significantly constrained. This way we are also almost guaranteed to never find ourselves in unintended infinite loops.

Constant memory use. Related to the above, there is no dynamically allocated memory anywhere so there is also little possibility of swapping to disk, or memory leaks that you have to hunt down in your code.

It is highly portable. A sequence of matrix multiplies is significantly easier to run on arbitrary computational configurations compared to classical binaries or scripts.

It is very agile. If you had a C++ code and someone wanted you to make it twice as fast (at cost of performance if needed), it would be highly non-trivial to tune the system for the new spec. However, in Software 2.0 we can take our network, remove half of the channels, retrain, and there — it runs exactly at twice the speed and works a bit worse. It’s magic. Conversely, if you happen to get more data/compute, you can immediately make your program work better just by adding more channels and retraining.

Modules can meld into an optimal whole. Our software is often decomposed into modules that communicate through public functions, APIs, or endpoints. However, if two Software 2.0 modules that were originally trained separately interact, we can easily backpropagate through the whole. Think about how amazing it could be if your web browser could automatically re-design the low-level system instructions 10 stacks down to achieve a higher efficiency in loading web pages. Or if the computer vision library (e.g. OpenCV) you imported could be auto-tuned on your specific data. With 2.0, this is the default behavior.

It is better than you. Finally, and most importantly, a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals, which currently at the very least involve anything to do with images/video and sound/speech.

The limitations of Software 2.0
The 2.0 stack also has some of its own disadvantages. At the end of the optimization we’re left with large networks that work well, but it’s very hard to tell how. Across many applications areas, we’ll be left with a choice of using a 90% accurate model we understand, or 99% accurate model we don’t.

The 2.0 stack can fail in unintuitive and embarrassing ways ,or worse, they can “silently fail”, e.g., by silently adopting biases in their training data, which are very difficult to properly analyze and examine when their sizes are easily in the millions in most cases.

Finally, we’re still discovering some of the peculiar properties of this stack. For instance, the existence of adversarial examples and attacks highlights the unintuitive nature of this stack.

なぜ、複雑なプログラムをSoftware 2.0に移植することが好ましいのでしょうか?明らかに、1つの簡単な答えは、その方が実際によく動くからです。しかし、このスタックを好む便利な理由は他にもたくさんあります。ソフトウェア2.0(ConvNetを考えてみてください)とソフトウェア1.0(プロダクションレベルのC++コードベースを考えてみてください)を比較して、その利点をいくつか見ていきましょう。ソフトウェア2.0は






非常に俊敏である。C++のコードを持っていて、誰かがそれを(必要なら性能を犠牲にして)2倍速くすることを望んだとしたら、新しい仕様に合わせてシステムを調整することは非常に非自明なことでしょう。しかし、Software 2.0では、私たちのネットワークを、チャンネルの半分を削除し、再トレーニングすることができます。これは魔法のようなものです。逆に、データや計算量が増えたら、チャンネルを増やして再トレーニングを行うだけで、すぐにプログラムをより良く動作させることができます。

モジュールは最適な全体像に融合できる。私たちのソフトウェアは、多くの場合、パブリック関数、API、またはエンドポイントを通じて通信するモジュールに分解されています。しかし、もともと別々に訓練された2つのSoftware 2.0モジュールが相互作用すれば、簡単に全体をバックプロパゲートすることができるのです。ウェブブラウザが、ウェブページの読み込み効率を上げるために、10スタック下の低レベルのシステム命令を自動的に再設計してくれたら、どんなに素晴らしいか考えてみてください。あるいは、インポートしたコンピュータビジョンライブラリ(OpenCVなど)が、特定のデータに対して自動的にチューニングされるとしたら。2.0では、これがデフォルトの動作になっています。





Programming in the 2.0 stack
Software 1.0 is code we write. Software 2.0 is code written by the optimization based on an evaluation criterion (such as “classify this training data correctly”). It is likely that any setting where the program is not obvious but one can repeatedly evaluate the performance of it (e.g. — did you classify some images correctly? do you win games of Go?) will be subject to this transition, because the optimization can find much better code than what a human can write.

The lens through which we view trends matters. If you recognize Software 2.0 as a new and emerging programming paradigm instead of simply treating neural networks as a pretty good classifier in the class of machine learning techniques, the extrapolations become more obvious, and it’s clear that there is much more work to do.

In particular, we’ve built up a vast amount of tooling that assists humans in writing 1.0 code, such as powerful IDEs with features like syntax highlighting, debuggers, profilers, go to def, git integration, etc. In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets. For example, when the network fails in some hard or rare cases, we do not fix those predictions by writing code, but by including more labeled examples of those cases. Who is going to develop the first Software 2.0 IDEs, which help with all of the workflows in accumulating, visualizing, cleaning, labeling, and sourcing datasets? Perhaps the IDE bubbles up images that the network suspects are mislabeled based on the per-example loss, or assists in labeling by seeding labels with predictions, or suggests useful examples to label based on the uncertainty of the network’s predictions.

Similarly, Github is a very successful home for Software 1.0 code. Is there space for a Software 2.0 Github? In this case repositories are datasets and commits are made up of additions and edits of the labels.

Traditional package managers and related serving infrastructure like pip, conda, docker, etc. help us more easily deploy and compose binaries. How do we effectively deploy, share, import and work with Software 2.0 binaries? What is the conda equivalent for neural networks?

In the short term, Software 2.0 will become increasingly prevalent in any domain where repeated evaluation is possible and cheap, and where the algorithm itself is difficult to design explicitly. There are many exciting opportunities to consider the entire software development ecosystem and how it can be adapted to this new programming paradigm. And in the long run, the future of this paradigm is bright because it is increasingly clear that when we develop AGI, it will certainly be written in Software 2.0.



特に、シンタックスハイライトなどの機能を持つ強力なIDE、デバッガ、プロファイラ、go to def、git統合など、1.0のコードを書くために人間を補助する膨大なツールを構築してきました。2.0スタックでは、プログラミングは、データセットを蓄積し、マスキングし、クリーニングすることで行われます。例えば、ネットワークが難しいケースや稀なケースで失敗した場合、コードを書くことによってそれらの予測を修正するのではなく、それらのケースのラベル付けされた例をより多く含めることによって修正するのです。データセットの蓄積、視覚化、クリーニング、ラベル付け、ソーシングにおけるすべてのワークフローを支援する、最初のソフトウェア2.0 IDEを誰が開発するのだろうか?おそらくIDEは、ネットワークがサンプルごとの損失に基づいて誤ったラベル付けをしたと疑う画像をバブルアップしたり、ラベルに予測値をシードすることでラベル付けを支援したり、ネットワークの予測値の不確実性に基づいてラベル付けに有用なサンプルを提案したりする。

同様に、GithubはSoftware 1.0のコードのための非常に成功したホームである。ソフトウェア2.0のGithubのためのスペースはあるのだろうか?この場合、リポジトリはデータセットであり、コミットはラベルの追加と編集で構成されています。

従来のパッケージマネージャや、pip、conda、dockerなどの関連するサービングインフラは、バイナリをより簡単にデプロイし、構成するのに役立っています。Software 2.0のバイナリを効果的にデプロイ、共有、インポート、作業するにはどうしたらよいでしょうか。ニューラルネットに相当するcondaは何でしょうか?

短期的には、繰り返しの評価が可能で安価であり、アルゴリズム自体を明示的に設計することが困難なあらゆる領域で、Software 2.0がますます普及することになるでしょう。ソフトウェア開発のエコシステム全体と、この新しいプログラミングパラダイムにどのように適応させるかを検討する、多くのエキサイティングな機会が存在します。そして、長い目で見れば、このパラダイムの未来は明るい。なぜなら、我々がAGIを開発するとき、それは確実にソフトウェア2.0で書かれていることがますます明らかになってきているからである。