【英語学習：Software 2.0】ソフトウェア2.0

引用：Software 2.0

I sometimes see people refer to neural networks as just “another tool in your machine learning toolbox”. They have some pros and cons, they work here or there, and sometimes you can use them to win Kaggle competitions. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software. They are Software 2.0.

The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer identifies a specific point in program space with some desirable behavior.

In contrast, Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried).

Instead, our approach is to specify some goal on the behavior of a desirable program (e.g., “satisfy a dataset of input output pairs of examples”, or “win a game of Go”), write a rough skeleton of the code (i.e. a neural net architecture) that identifies a subset of program space to search, and use the computational resources at our disposal to search this space for a program that works. In the case of neural networks, we restrict the search to a continuous subset of the program space where the search process can be made (somewhat surprisingly) efficient with backpropagation and stochastic gradient descent.

To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets. This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labelers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces.

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program. Because of this and many other benefits of Software 2.0 programs that I will go into below, we are witnessing a massive transition across the industry where of a lot of 1.0 code is being ported into 2.0 code. Software (1.0) is eating the world, and now AI (Software 2.0) is eating software.

ニューラルネットワークを「機械学習の道具箱の中の1つのツール」と呼ぶ人を時々見かけます。ニューラルネットワークには長所と短所があり、あちこちで活躍し、時にはKaggleコンペティションで勝つために使うこともあります。残念ながら、この解釈は木を見て森を見ずです。ニューラルネットワークは単なる分類器ではなく、私たちがソフトウェアを開発する方法における根本的な転換の始まりを象徴しているのです。つまり、ソフトウェア2.0なのです。

ソフトウェア1.0の「古典的なスタック」は、私たちがよく知っているもので、PythonやC++などの言語で書かれています。これは、プログラマーによって書かれたコンピュータへの明示的な指示で構成されています。プログラマは、コードの各行を記述することで、プログラム空間における特定のポイントと、ある望ましい振る舞いを特定するのです。

これに対し、ソフトウェア2.0は、より抽象的で人間に馴染みにくい言語で書かれている。例えば、ニューラルネットワークの重みのようなものです。このコードを書くのに人間は関与しません。なぜなら、重みは大量にあり（典型的なネットワークは数百万個あるかもしれません）、重みで直接コーディングするのはちょっと難しいからです（私も試しました）。

その代わりに、私たちのアプローチは、望ましいプログラムの動作に関する何らかの目標（例えば、「入力と出力のペアの例のデータセットを満たす」、「囲碁のゲームに勝つ」）を指定し、探索すべきプログラム空間のサブセットを特定するコード（すなわち、ニューラルネットのアーキテクチャ）のラフスケルトンを書き、自由に使える計算資源を使ってこの空間を探索して、動作するプログラムを探すことである。ニューラルネットの場合、バックプロパゲーションと確率的勾配降下により、プログラム空間の連続的な部分集合に探索を限定し、探索プロセスを（驚くほど）効率的にすることができる。

このアナロジーを明確にするために、ソフトウェア1.0では、人間が設計したソースコード（例えば、いくつかの.cppファイル）が、役に立つ仕事をするバイナリにコンパイルされます。ソフトウェア2.0では、多くの場合、ソースコードは、1）望ましい動作を定義するデータセットと、2）コードの大まかな骨格を与えるが、多くの詳細（重み）を埋める必要があるニューラルネットのアーキテクチャで構成されます。ニューラルネットの学習プロセスでは、データセットがバイナリにコンパイルされ、最終的なニューラルネットとなります。今日、ほとんどの実用的なアプリケーションでは、ニューラルネットのアーキテクチャと学習システムはますます標準化されて商品化されているので、アクティブな「ソフトウェア開発」のほとんどは、ラベル付きデータセットをキュレーションし、成長させ、マッサージし、クリーニングする形をとっている。2.0プログラマー（データラベラー）がデータセットを編集して成長させる一方で、少数の1.0プログラマーが周辺のトレーニングコード基盤、分析、視覚化、ラベリングインターフェースを保守して反復しているのです。

実世界の問題の大部分は、明示的にプログラムを書くよりも、データを収集する（より一般的には望ましい動作を特定する）方がはるかに簡単であるという性質を持っていることが判明した。このことや、以下で説明するソフトウェア2.0プログラムの他の多くの利点のために、私たちは業界全体で、多くの1.0コードが2.0コードに移植される大規模な移行を目の当たりにしているのです。ソフトウェア（1.0）が世界を食べ、そして今、AI（ソフトウェア2.0）がソフトウェアを食べているのです。

Ongoing transition
Let’s briefly examine some concrete examples of this ongoing transition. In each of these areas we’ve seen improvements over the last few years when we give up on trying to address a complex problem by writing explicit code and instead transition the code into the 2.0 stack.

Visual Recognition used to consist of engineered features with a bit of machine learning sprinkled on top at the end (e.g., an SVM). Since then, we discovered much more powerful visual features by obtaining large datasets (e.g. ImageNet) and searching in the space of Convolutional Neural Network architectures. More recently, we don’t even trust ourselves to hand-code the architectures and we’ve begun searching over those as well.

Speech recognition used to involve a lot of preprocessing, gaussian mixture models and hidden markov models, but today consist almost entirely of neural net stuff. A very related, often cited humorous quote attributed to Fred Jelinek from 1985 reads “Every time I fire a linguist, the performance of our speech recognition system goes up”.

Speech synthesis has historically been approached with various stitching mechanisms, but today the state of the art models are large ConvNets (e.g. WaveNet) that produce raw audio signal outputs.

Machine Translation has usually been approaches with phrase-based statistical techniques, but neural networks are quickly becoming dominant. My favorite architectures are trained in the multilingual setting, where a single model translates from any source language to any target language, and in weakly supervised (or entirely unsupervised) settings.

Games. Explicitly hand-coded Go playing programs have been developed for a long while, but AlphaGo Zero (a ConvNet that looks at the raw state of the board and plays a move) has now become by far the strongest player of the game. I expect we’re going to see very similar results in other areas, e.g. DOTA 2, or StarCraft.

Databases. More traditional systems outside of Artificial Intelligence are also seeing early hints of a transition. For instance, “The Case for Learned Index Structures” replaces core components of a data management system with a neural network, outperforming cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory.

You’ll notice that many of my links above involve work done at Google. This is because Google is currently at the forefront of re-writing large chunks of itself into Software 2.0 code. “One model to rule them all” provides an early sketch of what this might look like, where the statistical strength of the individual domains is amalgamated into one consistent understanding of the world.

進行中の変遷
この継続的な移行について、いくつかの具体的な例を簡単に見てみましょう。いずれの分野でも、複雑な問題に対処するために明示的なコードを書くことをあきらめ、代わりにコードを2.0スタックに移行することで、ここ数年で改善が見られました。

視覚認識では、以前は工学的な特徴量と、最後に機械学習（SVMなど）を少し加えるだけで構成されていました。その後、大規模なデータセット（ImageNetなど）を取得し、畳み込みニューラルネットワークのアーキテクチャの空間を探索することで、より強力な視覚的特徴を発見してきました。最近では、アーキテクチャをハンドコーディングすることさえ信用できなくなり、それらも含めて探索するようになりました。

音声認識には、以前は多くの前処理、ガウス混合モデル、隠れマルコフモデルが必要でしたが、現在ではほとんどニューラルネットのものだけで構成されています。1985年にFred Jelinekが発表した「私が言語学者をクビにするたびに、音声認識システムの性能は上がる」というユーモラスな言葉がありますが、これと非常に関連性があります。

音声合成は歴史的に様々なステッチングメカニズムでアプローチされてきましたが、今日、最先端のモデルは生の音声信号出力を生成する大規模なConvNets（例：WaveNet）です。

機械翻訳は、フレーズベースの統計的手法でアプローチするのが一般的でしたが、最近ではニューラルネットワークが主流になりつつあります。私のお気に入りのアーキテクチャは、多言語設定、つまり単一のモデルであらゆるソース言語からあらゆるターゲット言語への翻訳、および弱い教師あり（または完全に教師なし）設定で訓練されたものです。

ゲーム明示的にハンドコードされた囲碁のプレイプログラムは長い間開発されてきましたが、AlphaGo Zero（盤面の生の状態を見て手を打つConvNet）は現在、このゲームの圧倒的な最強プレイヤーになっています。他の分野でも、例えばDOTA 2やStarCraftなど、非常に似たような結果が出るのではないかと期待しています。

データベース。人工知能以外の、より伝統的なシステムにも、移行の初期段階でのヒントが見受けられます。例えば、「The Case for Learned Index Structures」では、データ管理システムのコアコンポーネントをニューラルネットワークに置き換え、キャッシュに最適化されたB-Treeを最大70%上回るスピードと1桁多いメモリ使用量の削減を実現しています。

上記のリンクの多くが、Googleで行われた仕事であることにお気づきでしょう。これは、Googleが現在、自分自身の大きな塊をSoftware 2.0コードに書き換える最前線にいるからです。「個々のドメインの統計的な強さが、世界に関する1つの一貫した理解に統合されるのです。

The benefits of Software 2.0
Why should we prefer to port complex programs into Software 2.0? Clearly, one easy answer is that they work better in practice. However, there are a lot of other convenient reasons to prefer this stack. Let’s take a look at some of the benefits of Software 2.0 (think: a ConvNet) compared to Software 1.0 (think: a production-level C++ code base). Software 2.0 is:

Computationally homogeneous. A typical neural network is, to the first order, made up of a sandwich of only two operations: matrix multiplication and thresholding at zero (ReLU). Compare that with the instruction set of classical software, which is significantly more heterogenous and complex. Because you only have to provide Software 1.0 implementation for a small number of the core computational primitives (e.g. matrix multiply), it is much easier to make various correctness/performance guarantees.

Simple to bake into silicon. As a corollary, since the instruction set of a neural network is relatively small, it is significantly easier to implement these networks much closer to silicon, e.g. with custom ASICs, neuromorphic chips, and so on. The world will change when low-powered intelligence becomes pervasive around us. E.g., small, inexpensive chips could come with a pretrained ConvNet, a speech recognizer, and a WaveNet speech synthesis network all integrated in a small protobrain that you can attach to stuff.

Constant running time. Every iteration of a typical neural net forward pass takes exactly the same amount of FLOPS. There is zero variability based on the different execution paths your code could take through some sprawling C++ code base. Of course, you could have dynamic compute graphs but the execution flow is normally still significantly constrained. This way we are also almost guaranteed to never find ourselves in unintended infinite loops.

Constant memory use. Related to the above, there is no dynamically allocated memory anywhere so there is also little possibility of swapping to disk, or memory leaks that you have to hunt down in your code.

It is highly portable. A sequence of matrix multiplies is significantly easier to run on arbitrary computational configurations compared to classical binaries or scripts.

It is very agile. If you had a C++ code and someone wanted you to make it twice as fast (at cost of performance if needed), it would be highly non-trivial to tune the system for the new spec. However, in Software 2.0 we can take our network, remove half of the channels, retrain, and there — it runs exactly at twice the speed and works a bit worse. It’s magic. Conversely, if you happen to get more data/compute, you can immediately make your program work better just by adding more channels and retraining.

Modules can meld into an optimal whole. Our software is often decomposed into modules that communicate through public functions, APIs, or endpoints. However, if two Software 2.0 modules that were originally trained separately interact, we can easily backpropagate through the whole. Think about how amazing it could be if your web browser could automatically re-design the low-level system instructions 10 stacks down to achieve a higher efficiency in loading web pages. Or if the computer vision library (e.g. OpenCV) you imported could be auto-tuned on your specific data. With 2.0, this is the default behavior.

It is better than you. Finally, and most importantly, a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals, which currently at the very least involve anything to do with images/video and sound/speech.

The limitations of Software 2.0
The 2.0 stack also has some of its own disadvantages. At the end of the optimization we’re left with large networks that work well, but it’s very hard to tell how. Across many applications areas, we’ll be left with a choice of using a 90% accurate model we understand, or 99% accurate model we don’t.

The 2.0 stack can fail in unintuitive and embarrassing ways ,or worse, they can “silently fail”, e.g., by silently adopting biases in their training data, which are very difficult to properly analyze and examine when their sizes are easily in the millions in most cases.

Finally, we’re still discovering some of the peculiar properties of this stack. For instance, the existence of adversarial examples and attacks highlights the unintuitive nature of this stack.

ソフトウェア2.0のメリット
なぜ、複雑なプログラムをSoftware 2.0に移植することが好ましいのでしょうか？明らかに、1つの簡単な答えは、その方が実際によく動くからです。しかし、このスタックを好む便利な理由は他にもたくさんあります。ソフトウェア2.0（ConvNetを考えてみてください）とソフトウェア1.0（プロダクションレベルのC++コードベースを考えてみてください）を比較して、その利点をいくつか見ていきましょう。ソフトウェア2.0は

計算機的に均質であること。典型的なニューラルネットワークは、一次的には、行列の乗算とゼロでの閾値（ReLU）というたった2つの演算のサンドイッチで構成されています。古典的なソフトウェアの命令セットはもっと異質で複雑である。ソフトウェア1.0の実装は、少数のコアな計算プリミティブ（例えば行列の乗算）だけを提供すればよいので、様々な正しさや性能を保証することがはるかに容易です。

シリコンに焼き付けるのも簡単。補足すると、ニューラルネットワークの命令セットは比較的小さいので、カスタムASICやニューロモルフィックチップなどで、これらのネットワークをシリコンに近い形で実装することが格段に容易になる。低消費電力のインテリジェンスが私たちの周りに普及したとき、世界は変わるでしょう。例えば、小型で安価なチップには、事前に学習させたConvNet、音声認識器、WaveNet音声合成ネットワークがすべて統合され、小さな原始脳に搭載され、物に取り付けることができるようになるかもしれません。

一定の実行時間。典型的なニューラルネットのフォワードパスの各反復は、まったく同じ量のFLOPSを必要とします。C++の膨大なコードベースの中で、コードが取りうるさまざまな実行経路に基づく変動はありません。もちろん、動的な計算グラフを持つこともできますが、それでも通常、実行フローはかなり制約されます。この方法では、意図しない無限ループに陥ることがないこともほぼ保証されています。

一定のメモリ使用量。上記と関連して、動的に割り当てられるメモリはどこにもないので、ディスクへのスワップや、コード内で探し出さなければならないメモリリークの可能性もほとんどない。

移植性が高い。一連の行列乗算は、古典的なバイナリやスクリプトと比較して、任意の計算機構成で実行することが非常に容易です。

非常に俊敏である。C++のコードを持っていて、誰かがそれを（必要なら性能を犠牲にして）2倍速くすることを望んだとしたら、新しい仕様に合わせてシステムを調整することは非常に非自明なことでしょう。しかし、Software 2.0では、私たちのネットワークを、チャンネルの半分を削除し、再トレーニングすることができます。これは魔法のようなものです。逆に、データや計算量が増えたら、チャンネルを増やして再トレーニングを行うだけで、すぐにプログラムをより良く動作させることができます。

モジュールは最適な全体像に融合できる。私たちのソフトウェアは、多くの場合、パブリック関数、API、またはエンドポイントを通じて通信するモジュールに分解されています。しかし、もともと別々に訓練された2つのSoftware 2.0モジュールが相互作用すれば、簡単に全体をバックプロパゲートすることができるのです。ウェブブラウザが、ウェブページの読み込み効率を上げるために、10スタック下の低レベルのシステム命令を自動的に再設計してくれたら、どんなに素晴らしいか考えてみてください。あるいは、インポートしたコンピュータビジョンライブラリ（OpenCVなど）が、特定のデータに対して自動的にチューニングされるとしたら。2.0では、これがデフォルトの動作になっています。

あなたより優れているのです。最後に、そして最も重要なことですが、ニューラルネットワークは、現在少なくとも画像/ビデオと音/音声に関係するものを含む価値ある垂直方向の大部分において、あなたや私が考え付くものより優れたコード片です。

ソフトウェア2.0の限界
2.0スタックには、それなりのデメリットもあります。最適化の果てに、うまく機能している大規模なネットワークが残されていますが、その仕組みを説明するのは非常に困難です。多くのアプリケーション領域で、90%の精度で理解できるモデルを使うか、99%の精度で理解できないモデルを使うかの選択を迫られることになるのです。

2.0スタックは、直感的でない恥ずかしい方法で失敗することがあり、さらに悪いことには、例えば、学習データにバイアスをかけるなどして、「静かに失敗」することもあります。

最後に、このスタックの特異な性質について、我々はまだ発見中です。例えば、敵対的な例や攻撃が存在することは、このスタックの直感的でない性質を強調しています。

Programming in the 2.0 stack
Software 1.0 is code we write. Software 2.0 is code written by the optimization based on an evaluation criterion (such as “classify this training data correctly”). It is likely that any setting where the program is not obvious but one can repeatedly evaluate the performance of it (e.g. — did you classify some images correctly? do you win games of Go?) will be subject to this transition, because the optimization can find much better code than what a human can write.

The lens through which we view trends matters. If you recognize Software 2.0 as a new and emerging programming paradigm instead of simply treating neural networks as a pretty good classifier in the class of machine learning techniques, the extrapolations become more obvious, and it’s clear that there is much more work to do.

In particular, we’ve built up a vast amount of tooling that assists humans in writing 1.0 code, such as powerful IDEs with features like syntax highlighting, debuggers, profilers, go to def, git integration, etc. In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets. For example, when the network fails in some hard or rare cases, we do not fix those predictions by writing code, but by including more labeled examples of those cases. Who is going to develop the first Software 2.0 IDEs, which help with all of the workflows in accumulating, visualizing, cleaning, labeling, and sourcing datasets? Perhaps the IDE bubbles up images that the network suspects are mislabeled based on the per-example loss, or assists in labeling by seeding labels with predictions, or suggests useful examples to label based on the uncertainty of the network’s predictions.

Similarly, Github is a very successful home for Software 1.0 code. Is there space for a Software 2.0 Github? In this case repositories are datasets and commits are made up of additions and edits of the labels.

Traditional package managers and related serving infrastructure like pip, conda, docker, etc. help us more easily deploy and compose binaries. How do we effectively deploy, share, import and work with Software 2.0 binaries? What is the conda equivalent for neural networks?

In the short term, Software 2.0 will become increasingly prevalent in any domain where repeated evaluation is possible and cheap, and where the algorithm itself is difficult to design explicitly. There are many exciting opportunities to consider the entire software development ecosystem and how it can be adapted to this new programming paradigm. And in the long run, the future of this paradigm is bright because it is increasingly clear that when we develop AGI, it will certainly be written in Software 2.0.

2.0スタックでのプログラミング
ソフトウェア1.0は、私たちが書いたコードです。ソフトウェア2.0は、ある評価基準（例えば「この学習データを正しく分類せよ」）に基づき、最適化によって書かれたコードです。最適化は人間が書くよりもずっと良いコードを見つけることができるので、プログラムが明白ではないが、そのパフォーマンスを繰り返し評価できるような設定（例えば、いくつかの画像を正しく分類したか、囲碁のゲームに勝つか）は、この移行の対象となる可能性が高いです。

トレンドを見るレンズは重要です。ニューラルネットを単に機械学習技術の中のかなり優れた分類器として扱うのではなく、ソフトウェア2.0を新しく出現したプログラミングパラダイムとして認識すれば、その外挿はより明白になり、やるべきことがたくさんあることが明らかになるのです。

特に、シンタックスハイライトなどの機能を持つ強力なIDE、デバッガ、プロファイラ、go to def、git統合など、1.0のコードを書くために人間を補助する膨大なツールを構築してきました。2.0スタックでは、プログラミングは、データセットを蓄積し、マスキングし、クリーニングすることで行われます。例えば、ネットワークが難しいケースや稀なケースで失敗した場合、コードを書くことによってそれらの予測を修正するのではなく、それらのケースのラベル付けされた例をより多く含めることによって修正するのです。データセットの蓄積、視覚化、クリーニング、ラベル付け、ソーシングにおけるすべてのワークフローを支援する、最初のソフトウェア2.0 IDEを誰が開発するのだろうか？おそらくIDEは、ネットワークがサンプルごとの損失に基づいて誤ったラベル付けをしたと疑う画像をバブルアップしたり、ラベルに予測値をシードすることでラベル付けを支援したり、ネットワークの予測値の不確実性に基づいてラベル付けに有用なサンプルを提案したりする。

同様に、GithubはSoftware 1.0のコードのための非常に成功したホームである。ソフトウェア2.0のGithubのためのスペースはあるのだろうか？この場合、リポジトリはデータセットであり、コミットはラベルの追加と編集で構成されています。

従来のパッケージマネージャや、pip、conda、dockerなどの関連するサービングインフラは、バイナリをより簡単にデプロイし、構成するのに役立っています。Software 2.0のバイナリを効果的にデプロイ、共有、インポート、作業するにはどうしたらよいでしょうか。ニューラルネットに相当するcondaは何でしょうか？

短期的には、繰り返しの評価が可能で安価であり、アルゴリズム自体を明示的に設計することが困難なあらゆる領域で、Software 2.0がますます普及することになるでしょう。ソフトウェア開発のエコシステム全体と、この新しいプログラミングパラダイムにどのように適応させるかを検討する、多くのエキサイティングな機会が存在します。そして、長い目で見れば、このパラダイムの未来は明るい。なぜなら、我々がAGIを開発するとき、それは確実にソフトウェア2.0で書かれていることがますます明らかになってきているからである。