There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.
如今的软件界有着数不清的开源项目,它们拥有疯狂的名字,但其中的大多数从来都没有入过企业的法眼,只有Hadoop是个例外。
Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO 2.57% and Facebook FB -0.46% ; more than half of the Fortune 50 use it, providers say.
Hadoop的名字来源于一个小孩的玩具,如今已被用于雅虎(Yahoo)和Facebook等公司的大数据程序中。供应商表示,《财富》50强中有半数以上的公司都在用它。
The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according toForrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”
根据弗雷斯特研究公司(Forrester)分析师麦克o瓜尔蒂耶里的说法,这个软件“在数据管理上采用了令人耳目一新的独特方法,改变了各公司存储、处理、分析和分享大数据的方式。”弗雷斯特认为Hadoop会成为大型企业必备的架构。Hadoop在2012年的全球市值为15亿美元,而到2020年,人们估计它的价值将会达到502亿美元。
Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.
一个草根的开源项目最终成了行业标准,并不是一件常有的事。Hadoop是如何做到的?
It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?
“一个拥有迫切需求的市场”
‘A market that was in desperate need’
分析公司RedMonk共同创始人和首席分析师史蒂芬o奥格雷迪说:“Hadoop是由基础的差异化技术、获得许可的开源代码库和迫切需要解决数据爆炸的方法的市场三者结合形成的巧合。从这一点上来说,它的成功并不令人意外。”
“Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”
这个软件的创造者是道格o卡廷和麦克o卡法雷拉。它与许多其他发明一样,都是应需而生。2002年,两人都在为一个叫做Nutch的开源搜索引擎工作。卡廷说:“我们取得了一些进展,在小范围的机器上运行了它。但我们仍然不清楚要怎么扩大它的使用范围,让它像谷歌(Google)一样被成千上万的机器使用。”
Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.
之后不久,谷歌就谷歌文件系统(Google File System)和MapReduce发表了一系列学术论文,卡法雷拉说:“于是我们很快就清楚了,Nutch需要拥有一些类似的架构。”
Shortly thereafter Google GOOG -0.34% published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.
卡廷解释道:“谷歌处理问题的方法与众不同,十分有用。”目前为止,人们通常认为“你需要为每一个想要完成的分布式任务建立专门的系统”,而在这一点上,谷歌提供了一个通用的自动化架构来完成分布式计算。卡廷说:“它能够处理分布式计算中的那些困难的部分,如此一来,人们就可以专心编写自己的程序。”
“The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.
卡廷和卡法雷拉【如今分别是Cloudera首席架构师和密歇根大学(University of Michigan)计算机科学和工程专业的助理教授】知道,他们得做出自己的架构——不仅是为了Nutch,也是为了造福其他业内人士——他们明白自己想把它做成开源。
Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.
卡廷说:“我不喜欢商业的那些事,我只是个搞技术的。我喜欢写代码,与同事合作解决问题,完善我们的产品,而不是试着把它卖掉。我更愿意告诉别人‘这一点上它做得不错,那一点上太糟糕了,也许我们可以改进一下。’能够当一个彻底诚实的人感觉很好,而在商业环境中,你很难保持这一点。”
“I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”
但是这两人知道,这项技术一旦取得成功,将会具有巨大的潜力。卡廷说:“如果我没判断错,这是项很有用的技术,许多人都想用,那我就能付我的房租了,我们的初创公司也就没那么大风险了。”
But the pair knew that the potential upside of success could be staggering. “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.
对卡法雷拉而言,“将Nutch开源,部分原因是想要看到搜索引擎技术摆脱少数几家公司的垄断,但这也是一项战略决定。如此一来,我们就最可能得到来自大公司的工程师的帮助。我们特地选择了一个能让其他公司最轻松地参与进来的开源许可。”
It was a good decision. “Hadoop would not have become a big success without large investments from Yahoo and other firms,” Cafarella said.
这是一项英明的决定。卡法雷拉说:“如果没有雅虎和其他公司的大量投资,Hadoop可能不会这么成功。”
‘How would you compete with open source?’
“没谁拼得过开源产品?”
So Hadoop borrowed an idea from Google, made the concept open source, and both encouraged and got investment from powerhouses like Yahoo. But that wasn’t all that drove its success. Luck—in the form of sheer, unanticipated market demand—also played a key role.
所以Hadoop借用了一个来自谷歌的点子,把这个概念开源,然后得到了雅虎等大公司的鼓励和投资。但这并不是导致它成功的全部因素。运气——完全没有预想到的市场需求——也在其中起到了关键因素。
“I knew other people would probably have similar problems, but I had no idea just how many other people,” Cutting said. “I thought it would be mostly people building text search engines. I didn’t see it being used by folks in insurance, banking, oil discovery—all these places where it’s being used today.”
卡廷说:“我知道其他人可能会碰到类似的问题,但我不知道居然这么多人都有。我觉得大部分用户都会是文本搜索引擎的开发人员,可没料到许多从事保险业、银行业和石油勘探业的人也会用它——它已经在这些领域得到了应用。”
Looking back, “my conjecture is that we were early enough, and that the combination of being first movers and being open source and being a substantial effort kept there from being a lot of competitors early on,” he said. “Mike and I got so far, but it took tens of engineers from Yahoo several more years to make it stable.”
回首往昔,卡廷说:“我猜我们开展得足够早,作为第一批推动者,我们做的又是开源产品,也付出了大量努力,这一切让我们与许多早期竞争者区分了开来。麦克和我已经研发了很久,不过来自雅虎的几十位工程师又花了好几年时间才让这个架构变得稳定。”
And even if a competitor did manage to catch up, “how would you compete with something open source?” Cutting said. “Competing against open source is a tough game—everybody else is collaborating on it; the cost is zero. It’s easier to join than to fight.”
卡廷表示,即便有竞争者想要迎头赶上,“你又怎么能拼得过开源产品呢?和开源产品竞争是非常困难的事——其他所有人都会为它做贡献,他们没有成本。加入他们比对抗他们更容易。”
IBM IBM -0.24% , Microsoft MSFT -1.30% , and Oracle ORCL 0.00% are among the large companies that chose to collaborate with Hadoop.
国际商业机器公司(IBM)、微软(Microsoft)和甲骨文(Oracle)就在那些选择同Hadoop合作的大公司之列。
Though Cafarella isn’t surprised that Web companies use Hadoop, he is astonished at “how many people now have data management problems that 12 years ago were exceedingly rare,” he said. “Everyone now has the problems that used to belong to just Yahoo and Google.”
尽管卡法雷拉并不奇怪网络公司会使用Hadoop,但他表示,他对“这么多人都碰到了12年前极为罕见的数据管理问题”感到震惊。“曾经只有雅虎和谷歌才存在的问题,现在困扰着每一个人。”
Hadoop represents “somewhat of a turning point in the primary drivers of open source software technology,” said Jay Lyman, a senior analyst for enterprise software with 451 Research. Before, open source software such as the Linux operating system were best known for offering a cost-effective alternative to proprietary software like Microsoft’s Windows. “Cost savings and efficiency drove much of the enterprise use,” Lyman said.
信息技术研究公司451 Research的企业软件高级研究员杰伊o莱曼表示,Hadoop代表了“一种开源软件技术的主要推动者的转折点。”在这之前,开源软件比如Linux操作系统,是因为提供了微软Windows这类专有软件之外的合算选择,才声名鹊起。“企业使用它们,大部分都是出于节约成本、提高效益的考量。”
With the advent of NoSQL databases and Hadoop, however, “we saw innovation among the primary drivers of adoption and use,” Lyman said. “When it comes to NoSQL or Hadoop technology, there is not really a proprietary alternative.”
不过,随着非关系型数据库(NoSQL)和Hadoop的出现,莱曼说,“我们看到使用者中出现了有创新之举的推动者。非关系型数据库和Hadoop技术并不真正属于专有技术之外的其他选择。”
Hadoop’s success has come as a pleasant surprise to its creators. “I didn’t expect an open source project would ever take over an industry like this,” Cutting said. “I’m overjoyed.”
Hadoop的成功对创造者来说是一种惊喜。卡廷说:“我没有想到一个开源项目能够像这样引领着行业。我太高兴了。”
And it’s still on a roll. “Hadoop is now much bigger than the original components,” Cafarella said. “It’s an entire stack of tools, and the stack keeps growing. Individual components might have some competition—mainly MapReduce—but I don’t see any strong alternative to the overall Hadoop ecosystem.”
它仍然发展得如火如荼。卡法雷拉说:“比起最早的组件,Hadoop现在庞大多了。它已经成了一整套工具,而且还在继续扩充。单个的组件也许会遭遇竞争者——主要是MapReduce——但我没有见过能够取代整个Hadoop系统的强大对手。”
The project’s adaptability “argues for its continued success,” RedMonk’s O’Grady said. “Hadoop today is a very different, and more versatile, project than it was even a year or two ago.”
RedMonk的奥格雷迪说,这个项目的适应性“能够让它不断成功。现在的Hadoop非常与众不同,比起一年或者两年前,它的功能更加强大了。”
But there’s plenty of work to be done. Looking ahead, Cutting—with the support of Cloudera—has begun to focus on the policy needed to accommodate big data technology.
不过未来还有许多工作要做。接下来,在Cloudera的支持下,卡廷要开始专注于研究与大数据技术配套的法律政策。
“Now that we have this technology and so much digitization of just about every aspect of commerce and government and we have these tools to process all this digital data, we need to make sure we’re using it in ways we think are in the interests of society,” he said. “In many ways, the policy needs to catch up with the technology.
卡廷说:“现在我们有了这项技术,商业和政府的方方面面几乎都已经大幅数字化了,我们也有处理所有这些数据的工具。我们现在需要保证使用它们是出于造福社会的目的。从许多方面看,政策都需要紧跟技术的脚步。”
“One way or other, we are going to end up with laws. We want them to be the right ones.”
“不管怎样,我们最终都要涉及法律。我们希望它们用在正当的地方。”