從草根工程到行業標準一個開源小項目的進化神話

本文已影響 4.41K人

There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.

如今的軟件界有着數不清的開源項目，它們擁有瘋狂的名字，但其中的大多數從來都沒有入過企業的法眼，只有Hadoop是個例外。

Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO 2.57% and Facebook FB -0.46% ; more than half of the Fortune 50 use it, providers say.

Hadoop的名字來源於一個小孩的玩具，如今已被用於雅虎（Yahoo）和Facebook等公司的大數據程序中。供應商表示，《財富》50強中有半數以上的公司都在用它。

The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according toForrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”

根據弗雷斯特研究公司（Forrester）分析師麥克o瓜爾蒂耶裏的說法，這個軟件“在數據管理上採用了令人耳目一新的獨特方法，改變了各公司存儲、處理、分析和分享大數據的方式。”弗雷斯特認爲Hadoop會成爲大型企業必備的架構。Hadoop在2012年的全球市值爲15億美元，而到2020年，人們估計它的價值將會達到502億美元。

Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.

一個草根的開源項目最終成了行業標準，並不是一件常有的事。Hadoop是如何做到的？

It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?

“一個擁有迫切需求的市場”

‘A market that was in desperate need’

分析公司RedMonk共同創始人和首席分析師史蒂芬o奧格雷迪說：“Hadoop是由基礎的差異化技術、獲得許可的開源代碼庫和迫切需要解決數據爆炸的方法的市場三者結合形成的巧合。從這一點上來說，它的成功並不令人意外。”

“Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”

這個軟件的創造者是道格o卡廷和麥克o卡法雷拉。它與許多其他發明一樣，都是應需而生。2002年，兩人都在爲一個叫做Nutch的開源搜索引擎工作。卡廷說：“我們取得了一些進展，在小範圍的機器上運行了它。但我們仍然不清楚要怎麼擴大它的使用範圍，讓它像谷歌（Google）一樣被成千上萬的機器使用。”

Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.

之後不久，谷歌就谷歌文件系統（Google File System）和MapReduce發表了一系列學術論文，卡法雷拉說：“於是我們很快就清楚了，Nutch需要擁有一些類似的架構。”

Shortly thereafter Google GOOG -0.34% published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.

卡廷解釋道：“谷歌處理問題的方法與衆不同，十分有用。”目前爲止，人們通常認爲“你需要爲每一個想要完成的分佈式任務建立專門的系統”，而在這一點上，谷歌提供了一個通用的自動化架構來完成分佈式計算。卡廷說：“它能夠處理分佈式計算中的那些困難的部分，如此一來，人們就可以專心編寫自己的程序。”

“The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.

卡廷和卡法雷拉【如今分別是Cloudera首席架構師和密歇根大學（University of Michigan）計算機科學和工程專業的助理教授】知道，他們得做出自己的架構——不僅是爲了Nutch，也是爲了造福其他業內人士——他們明白自己想把它做成開源。

Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.

卡廷說：“我不喜歡商業的那些事，我只是個搞技術的。我喜歡寫代碼，與同事合作解決問題，完善我們的產品，而不是試着把它賣掉。我更願意告訴別人‘這一點上它做得不錯，那一點上太糟糕了，也許我們可以改進一下。’能夠當一個徹底誠實的人感覺很好，而在商業環境中，你很難保持這一點。”

“I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”

但是這兩人知道，這項技術一旦取得成功，將會具有巨大的潛力。卡廷說：“如果我沒判斷錯，這是項很有用的技術，許多人都想用，那我就能付我的房租了，我們的初創公司也就沒那麼大風險了。”

But the pair knew that the potential upside of success could be staggering. “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.

對卡法雷拉而言，“將Nutch開源，部分原因是想要看到搜索引擎技術擺脫少數幾家公司的壟斷，但這也是一項戰略決定。如此一來，我們就最可能得到來自大公司的工程師的幫助。我們特地選擇了一個能讓其他公司最輕鬆地參與進來的開源許可。”

It was a good decision. “Hadoop would not have become a big success without large investments from Yahoo and other firms,” Cafarella said.

這是一項英明的決定。卡法雷拉說：“如果沒有雅虎和其他公司的大量投資，Hadoop可能不會這麼成功。”

‘How would you compete with open source?’

“沒誰拼得過開源產品？”

So Hadoop borrowed an idea from Google, made the concept open source, and both encouraged and got investment from powerhouses like Yahoo. But that wasn’t all that drove its success. Luck—in the form of sheer, unanticipated market demand—also played a key role.

所以Hadoop借用了一個來自谷歌的點子，把這個概念開源，然後得到了雅虎等大公司的鼓勵和投資。但這並不是導致它成功的全部因素。運氣——完全沒有預想到的市場需求——也在其中起到了關鍵因素。

“I knew other people would probably have similar problems, but I had no idea just how many other people,” Cutting said. “I thought it would be mostly people building text search engines. I didn’t see it being used by folks in insurance, banking, oil discovery—all these places where it’s being used today.”

卡廷說：“我知道其他人可能會碰到類似的問題，但我不知道居然這麼多人都有。我覺得大部分用戶都會是文本搜索引擎的開發人員，可沒料到許多從事保險業、銀行業和石油勘探業的人也會用它——它已經在這些領域得到了應用。”

Looking back, “my conjecture is that we were early enough, and that the combination of being first movers and being open source and being a substantial effort kept there from being a lot of competitors early on,” he said. “Mike and I got so far, but it took tens of engineers from Yahoo several more years to make it stable.”

回首往昔，卡廷說：“我猜我們開展得足夠早，作爲第一批推動者，我們做的又是開源產品，也付出了大量努力，這一切讓我們與許多早期競爭者區分了開來。麥克和我已經研發了很久，不過來自雅虎的幾十位工程師又花了好幾年時間才讓這個架構變得穩定。”

And even if a competitor did manage to catch up, “how would you compete with something open source?” Cutting said. “Competing against open source is a tough game—everybody else is collaborating on it; the cost is zero. It’s easier to join than to fight.”

卡廷表示，即便有競爭者想要迎頭趕上，“你又怎麼能拼得過開源產品呢？和開源產品競爭是非常困難的事——其他所有人都會爲它做貢獻，他們沒有成本。加入他們比對抗他們更容易。”

IBM IBM -0.24% , Microsoft MSFT -1.30% , and Oracle ORCL 0.00% are among the large companies that chose to collaborate with Hadoop.

國際商業機器公司（IBM）、微軟（Microsoft）和甲骨文（Oracle）就在那些選擇同Hadoop合作的大公司之列。

Though Cafarella isn’t surprised that Web companies use Hadoop, he is astonished at “how many people now have data management problems that 12 years ago were exceedingly rare,” he said. “Everyone now has the problems that used to belong to just Yahoo and Google.”

儘管卡法雷拉並不奇怪網絡公司會使用Hadoop，但他表示，他對“這麼多人都碰到了12年前極爲罕見的數據管理問題”感到震驚。“曾經只有雅虎和谷歌才存在的問題，現在困擾着每一個人。”

Hadoop represents “somewhat of a turning point in the primary drivers of open source software technology,” said Jay Lyman, a senior analyst for enterprise software with 451 Research. Before, open source software such as the Linux operating system were best known for offering a cost-effective alternative to proprietary software like Microsoft’s Windows. “Cost savings and efficiency drove much of the enterprise use,” Lyman said.

信息技術研究公司451 Research的企業軟件高級研究員傑伊o萊曼表示，Hadoop代表了“一種開源軟件技術的主要推動者的轉折點。”在這之前，開源軟件比如Linux操作系統，是因爲提供了微軟Windows這類專有軟件之外的合算選擇，才聲名鵲起。“企業使用它們，大部分都是出於節約成本、提高效益的考量。”

With the advent of NoSQL databases and Hadoop, however, “we saw innovation among the primary drivers of adoption and use,” Lyman said. “When it comes to NoSQL or Hadoop technology, there is not really a proprietary alternative.”

不過，隨着非關係型數據庫（NoSQL）和Hadoop的出現，萊曼說，“我們看到使用者中出現了有創新之舉的推動者。非關係型數據庫和Hadoop技術並不真正屬於專有技術之外的其他選擇。”

Hadoop’s success has come as a pleasant surprise to its creators. “I didn’t expect an open source project would ever take over an industry like this,” Cutting said. “I’m overjoyed.”

Hadoop的成功對創造者來說是一種驚喜。卡廷說：“我沒有想到一個開源項目能夠像這樣引領着行業。我太高興了。”

And it’s still on a roll. “Hadoop is now much bigger than the original components,” Cafarella said. “It’s an entire stack of tools, and the stack keeps growing. Individual components might have some competition—mainly MapReduce—but I don’t see any strong alternative to the overall Hadoop ecosystem.”

它仍然發展得如火如荼。卡法雷拉說：“比起最早的組件，Hadoop現在龐大多了。它已經成了一整套工具，而且還在繼續擴充。單個的組件也許會遭遇競爭者——主要是MapReduce——但我沒有見過能夠取代整個Hadoop系統的強大對手。”

The project’s adaptability “argues for its continued success,” RedMonk’s O’Grady said. “Hadoop today is a very different, and more versatile, project than it was even a year or two ago.”

RedMonk的奧格雷迪說，這個項目的適應性“能夠讓它不斷成功。現在的Hadoop非常與衆不同，比起一年或者兩年前，它的功能更加強大了。”

But there’s plenty of work to be done. Looking ahead, Cutting—with the support of Cloudera—has begun to focus on the policy needed to accommodate big data technology.

不過未來還有許多工作要做。接下來，在Cloudera的支持下，卡廷要開始專注於研究與大數據技術配套的法律政策。

“Now that we have this technology and so much digitization of just about every aspect of commerce and government and we have these tools to process all this digital data, we need to make sure we’re using it in ways we think are in the interests of society,” he said. “In many ways, the policy needs to catch up with the technology.

卡廷說：“現在我們有了這項技術，商業和政府的方方面面幾乎都已經大幅數字化了，我們也有處理所有這些數據的工具。我們現在需要保證使用它們是出於造福社會的目的。從許多方面看，政策都需要緊跟技術的腳步。”

“One way or other, we are going to end up with laws. We want them to be the right ones.”

“不管怎樣，我們最終都要涉及法律。我們希望它們用在正當的地方。”