Big Data is on every CIO’s mind this quarter, and for good reason. Companies will have spent $4.3 billion on Big Data technologies by the end of 2012.

But here’s where it gets interesting. Those initial investments will in turn trigger a domino effectof upgrades and new initiatives that are valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion.

What you’re seeing right now is only the tip of a gigantic iceberg.
Big Data is presently synonymous with technologies like Hadoop, and the “NoSQL” class of databases including Mongo (document stores) and Cassandra (key-values). Today it’s possible to stream real-time analytics with ease. Spinning clusters up and down is a (relative) cinch, accomplished in 20 minutes or less. We have table stakes.
大数据以Hadoop以及"NO SQL"为主的Mongo和Cassandra等数据库技术在展现。现在数据的实时分析将可能容易一些。现在集群的转换将越来越可靠,20分钟以内就能够完成。因为我们用表来支持?
But there are new, untapped advantages and non-trivially large opportunities beyond these usual suspects.

Did you know that there are over 250K viable open source technologies on the market today? Innovation is all around us. The increasing complexity of systems, in fact, looks something like this:


We have a lot of…choices, to say the least.

What’s on our own radar, and what’s coming down the pipe for Fortune 2000 companies? What new projects are the most viable candidates for production-grade usage? Which deserve your undivided attention?

We did all the research and testing so you don’t have to. Let’s look at five new technologies that are shaking things up in Big Data. Here is the newest class of tools that you can’t afford to overlook, coming soon to an enterprise near you.
Storm and Kafka are the future of stream processing, and they are already in use at a number of high-profile companies including Groupon, Alibaba, and The Weather Channel.
Storm 和 Kafka 是未来数据流处理的主要方式,它们已经在一些大公司中使用率饿,包括 Groupon,阿里巴巴和The Weather Channel等

Born inside of Twitter, Storm is a “distributed real-time computation system”. Storm does for real-time processing what Hadoop did for batch processing. Kafka for its part is a messaging system developed at LinkedIn to serve as the foundation for their activity stream and the data processing pipeline behind it.

Storm,诞生于Twitter,是一个分布式实时计算系统。Storm 设计用于处理实时计算,hadoop主要用于处理批处理运算。

When paired together, you get the stream, you get it in-real time, and you get it at linear scale.

Why should you care? 你为什么需要关心?
With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.

Stream processing solutions like Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract, transform, load) and data integration.

Storm and Kafka are also great at in-memory analytics, and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs. Real-time streaming analytics is a must-have component in any enterprise Big Data solution or stack, because of how elegantly they handle the “three V’s” — volume, velocity and variety.
Storm 和 Kafka 也很擅长内存分析和实时决策支持。企业使用批量处理的Hadoop方案无法也难怪对实时的业务需求。在企业的大数据解决方案中实时数据流处理是一个必要的模块,因为它很优美的处理了“3v”--volume,velocity 和 variety (容量,速率和多样性)

Storm and Kafka are the two technologies on the list that we’re most committed to at Infochimps, and it is reasonable to expect that they’ll be a formal part of our platform soon.

Drill and Dremel make large-scale, ad-hoc querying of data possible, with radically lower latencies that are especially apt for data exploration. They make it possible to scan over petabytes of data in seconds, to answer ad hoc queries and presumably, power compelling visualizations.
Drill和Dremel 实现了快速低负载的大规模,即席查询数据搜索。它们提供了秒级搜索P级别数据的可能,来应对即席查询和预测,及提供强大的虚拟化支持。

Drill and Dremel put power in the hands of business analysts, and not just data engineers. The business side of the house will love Drill and Dremel.

Drill is the open source version of what Google is doing with Dremel (Google also offers Dremel-as-a-Service with its BigQuery offering). Companies are going to want to make the tool their own, which why Drill is the thing to watch mostly closely. Although it’s not quite there yet, strong interest by the development community is helping the tool mature rapidly.
Drill 是Google的Dremel的开源版本。Dremel是Google提供的支持大数据查询的技术。公司将用它来开发自己的工具,这些是导致大家都密切的关注Drill的原因。虽然这些不是起步,但是开源社区强烈的兴趣使得它变得更成熟。
Why should you care? 为什么你应该关心?
Drill and Dremel compare favorably to Hadoop for anything ad-hoc. Hadoop is all about batch processing workflows, which creates certain disadvantages.

The Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall to Pig and Hive, many interface layers have been built on top of Hadoop to make it more friendly, and business-accessible. Yet, for all of the SQL-like familiarity, these abstraction layers ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (read: running jobs, or “workflows”).

What if you’re not worried about running jobs? What if you’re more concerned with asking questions and getting answers — slicing and dicing, looking for insights?
如果你不担心跑的哪些任务? 如果你不关心这些产生的问题和去寻求答案,那就保持沉默,保持洞察力。

That’s “ad hoc exploration” in a nutshell — if you assume data that’s been processed already, how can you optimize for speed? You shouldn’t have to run a new job and wait, sometimes for considerable lengths of time, every time you want to ask a new question.
“即席探索" -- 如果你已经承担数据处理,你这么优化处理的速度?你不应该运行一个新的任务或者是等待,有时候考虑的时间还不如在问个新的问题。

In stark contrast to workflow-based methodology, most business-driven BI and analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Writing Map Reduce workflows is prohibitive for many business analysts. Waiting minutes for jobs to start and hours for workflows to complete is not conducive to an interactive experience of data, the comparing and contrasting, and the zooming in and out that ultimately creates fundamentally new insights.

Some data scientists even speculate that Drill and Dremel may actually be better than Hadoop in the wider sense, and a potential replacement, even. That’s a little too edgy a stance to embrace right now, but there is merit in an approach to analytics that is more query-oriented and low latency.

At Infochimps we like the Elasticsearch full-text search engine and database for doing high-level data exploration, but for truly capable Big Data querying at the (relative) seat level, we think that Drill will become the de facto solution.

R is an open source statistical programming language. It is incredibly powerful. Over two million (and counting) analysts use R. It’s been around since 1997 if you can believe it. It is a modern version of the S language for statistical computing that originally came out of the Bell Labs. Today, R is quickly becoming the new standard for statistics.

R performs complex data science at a much smaller price (both literally and figuratively). R is making serious headway in ousting SAS and SPSS from their thrones, and has become the tool of choice for the world’s best statisticians (and data scientists, and analysts too).

Why should you care? 为什么你应该关心?
Because it has an unusually strong community around it, you can find R libraries for almost anything under the sun — making virtually any kind of data science capability accessible without new code. R is exciting because of who is working on it, and how much net-new innovation is happening on a daily basis. the R community is one of the most thrilling places to be in Big Data right now.
R is a also wonderful way to future-proof your Big Data program. In the last few months, literally thousands of new features have been introduced, replete with publicly available knowledge bases for every analysis type you’d want to do as an organization.

Also, R works very well with Hadoop, making it an ideal part of an integrated Big Data approach.
To keep an eye on: Julia is an interesting and growing alternative to R, because it combats R’s notoriously slow language interpreter problem. The community around Julia isn’t nearly as strong right now, but if you have a need for speed…
保持关注:Julia ,是一个有趣的R的替代者,因为它不喜欢R的死慢死慢的解释器。Julia的社区虽然不怎么强大现在,但是如果你不是立即使用它的话,还是可以等等的。

GREMLIN AND GIRAPH Gremlin and Giraph help empower graph analysis, and are often used coupled with graph databases like Neo4j or InfiniteGraph, or in the case of Giraph, working with Hadoop. Golden Orbis another high-profile example of a graph-based project picking up steam.
Gremlin 和 Giraph 帮助增强图形分析,并在图数据库像Neo4j和InfiniteGraph中被使用,和与Hadoop协同工作的Giraph中被使用。Golden Orb是另一个高层面的流处理的图基础的项目的例子。可以看看。
Graph databases are pretty cutting edge. They have interesting differences with relational databases, which mean that sometimes you might want to take a graph approach rather than a relational approach from the very beginning.

The common analogue for graph-based approaches is Google’s Pregel, of which Gremlin and Giraph are open source alternatives. In fact, here’s a great read on how mimicry of Google technologies is a cottage industry unto itself.

Why should you care? 为什么要关新?
Graphs do a great job of modeling computer networks, and social networks, too — anything that links data together. Another common use is mapping, and geographic pathways — calculating shortest routes for example, from place A to place B (or to return to the social case, tracing the proximity of stated relationships from person A to person B).
Graphs are also popular for bioscience and physics use cases for this reason — they can chart molecular structures unusually well, for example.

Big picture, graph databases and analysis languages and frameworks are a great illustration of how the world is starting to realize that Big Data is not about having one database or one programming framework that accomplishes everything. Graph-based approaches are a killer app, so to speak, for anything that involves large networks with many nodes, and many linked pathways between those nodes.

The most innovative scientists and engineers know to apply the right tool for each job, making sure everything plays nice and can talk to each other (the glue in this sense becomes the core competence).

SAP Hana is an in-memory analytics platform that includes an in-memory database and a suite of tools and software for creating analytical processes and moving data in and out, in the right formats.
SAP Hana 是一个全内存的分析平台,它包含了一个内存数据库和一些相关的工具软件用来创建分析流程和规范正确的格式来进行数据的输入输出。
Why should you care? 为什么应该关心?
SAP is going against the grain of most entrenched enterprise mega-players by providing a very powerful product, free for development use. And it’s not only that — SAP is also creating meaningful incentives for startups to embrace Hana as well. They are authentically fostering community involvement and there is uniformly positive sentiment around Hana as a result.
SAP 开始反对为固化的企业用户提高强大的产品,供开发免费使用。这个不仅仅是SAP开始为初创着想,让其使用Hana。他们授权培养社区解决方案,这些不寻常的做法是围绕Hana的结果。

Hana highly benefits any applications with unusually fast processing needs, such as financial modeling and decision support, website personalization, and fraud detection, among many other use cases.
Hana 假设其他的程序处理时候还不够快的解决遇到的问题,例如,金融建模和决策支持,网站个性化和欺骗检测等等。

The biggest drawback of Hana is that “in-memory” means that it by definition leverages access to solid state memory, which has clear advantages, but is much more expensive than conventional disk storage.

For organizations that don’t mind the added operational cost, Hana means incredible speed for very-low latency big data processing.

D3 doesn’t make the list quite yet, but it’s close, and worth mentioning for that reason.
D3 本来不在列表中,但是它的亲切感,让我们认为有提它的价值。

D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information, and make data truly interactive. It was created by Michael Bostock and came out of his work at the New York Times, where he is the Graphics Editor.
它的作者是Michael Bostock一个纽约时报的图形界面设计师。
For example, you can use D3 to generate an HTML table from an array of numbers. Or, you can use the same data to create an interactive bar chart with smooth transitions and interaction.
Here’s an example of D3 in action, making President Obama’s 2013 budget proposal understandable, and navigable.

With D3, programmers can create dashboards galore. Organizations of all sizes are quickly embracing D3 as a superior visualization platform to the heads-up displays of yesteryear.

Editor’s note: Tim Gasper is the Product Manager at Infochimps, the #1 Big Data platform in the cloud. He leads product marketing, product development, and customer discovery. Previously, he was co-founder and CMO at Keepstream, a social media curation and analytics company that Infochimps acquired in August of 2010. You should follow him on Twitter here.



Storm和Kafka 从11年起,就开始关注了,Storm在阿里也有部分二线应用,但是整体而言,刚刚满一岁的Storm在nathanmarz大侠的打磨下越来越稳定了,并有部分线上的应用了。所以对这个技术,总体而言,我个人还是很看好的,因为现在使用hadoop无法实现实时的处理,使用HBase来为主要的数据库来使用了,暂时还是能解决,但是还是想尝试下Storm,Kafka的关注不是很多,不过这个配合起来使用,据说很赞,没有自己跑过。

Drill这个是Apache的开源项目,之前也看了Google Dremel的论文,无奈看不是很懂,现在也没有遇到这样的环境,而且社区才刚刚火起来,所以还没有很多的时间来跟进,暂时先搁置了。










2019-03-02 23:43





本篇文章是一篇翻译文章,对未来大数据领域的技术进行一些前瞻性的介绍,个人感觉他写的文章还是很好的,推荐的技术也具有的一定的代表性,遂将本篇文章翻译出来,感兴趣的大家能够看看。       大数据领域的处理,我自己本身接触的时间也不长,正式的项目还在开发之中,深受大数据处理方面的吸引,所以也就有写文章的想法的了。 原文链接: http://techcrunch.com/2012/10/27/big-


本篇文章是一篇翻译文章,对未来大数据领域的技术进行一些前瞻性的介绍,个人感觉他写的文章还是很好的,推荐的技术也具有的一定的代表性,遂将本篇文章翻译出来,感兴趣的大家能够看看。         大数据领域的处理,我自己本身接触的时间也不长,正式的项目还在开发之中,深受大数据处理方面的吸引,所以也就有写文章的想法的了。 原文链接: http://techcrunch.com/2012/10/27/bi


自我头脑风暴,说得对与错都请指点,后续补充。 当下三大技术热点:高并发、分布式、大数据(也许,还有很多,这只是自我修炼的三个目标)。 大数据处理系统有几点要求:    低延迟 高性能 分布式 可扩展(更多的要求是可横向扩展) 容错    现在用的比较多的就是Hadoop、Storm。 Hadoop     依赖于HDFS磁盘  延时较高(可精确到小时) 统计结果存在HBase    Storm


大数据是一个含义广泛的术语,是指数据集,如此庞大而复杂的,他们需要专门设计的硬件和软件工具进行处理。该数据集通常是万亿或 EB 的大小。这些数据集收集自各种各样的来源:传感器、气候信息、公开的信息、如杂志、报纸、文章。大数据产生的其他例子包括购买交易记录、网络日志、病历、 事监控、视频和图像档案、及大型电子商务。大数据分析是在研究大量的数据的过程中寻找模式,相关性和其他有用的信息,可以帮助企业更好


  本文来自ZDnet的记者Adrew Brust的博客文章。主要陈述大数据业内人士对2013年发展的预测和作者自己的一些观点。   在大数据领域,虽然对技术产业的预测是不可缺少的,但是Adrew对宣传一个具体公司的计划明显缺乏兴趣,而且许多业内公司都把他们明年的计划发给了Adrew,所以他认为:如果能把这些2013年的预测整合一下,再融入自己的观点,将是一件非常有意思的事情。   Hadoop的


http://blog.csdn.net/aquester/article/details/23340027 转自:http://www.cnblogs.com/DjangoBlog/p/3698222


如今Apache Hadoop已成为大数据行业发展背后的驱动力。Hive和Pig等技术也经常被提到,但是他们都有什么功能,为什么会需要奇怪的名字(如Oozie,ZooKeeper、Flume)。 Hadoop带来了廉价的处理大数据(大数据的数据容量通常是10-100GB或更多,同时数据种类多种多样,包括结构化、非结构化等)的能力。但这与之前有什么不同? 现今企业数据仓库和关系型数据库擅长处理结构化

大数据处理 Hadoop、HBase、ElasticSearch、Storm、Kafka、Spark

场景   伴随着信息科技日新月异的发展,信息呈现出爆发式的膨胀,人们获取信息的途径也更加多样、更加便捷,同时对于信息的时效性要求也越来越高。举个搜索场景中的例子,当一个卖家发布了一条宝贝信息时,他希望的当然是这个宝贝马上就可以被卖家搜索出来、点击、购买啦,相反,如果这个宝贝要等到第二天或者更久才可以被搜出来,估计这个大哥就要骂娘了。再举一个推荐的例子,如果用户昨天在淘宝上买了一双袜子,今天想买一副


随着大数据与预测分析的成熟,开源作为底层技术授权解决方案的最大贡献者的优势越来越明显。 如今,从小型初创企业到行业巨头,各种规模的供应商都在使用开源来处理大数据和运行预测分析。借助开源与云计算技术,新兴公司甚至在很多方面都可以与大厂商抗衡。 以下是一些大数据方面的顶级开源工具,分为四个领域:数据存储、开发平台、开发工具和集成、分析和报告工具。 数据存储:   Apache Hadoop– Clou


随着大数据与预测分析的成熟,开源作为底层技术授权解决方案的最大贡献者的优势越来越明显。 如今,从小型初创企业到行业巨头,各种规模的供应商都在使用开源来处理大数据和运行预测分析。借助开源与云计算技术,新兴公司甚至在很多方面都可以与大厂商抗衡。 以下是一些大数据方面的顶级开源工具,分为四个领域:数据存储、开发平台、开发工具和集成、分析和报告工具。 数据存储: Apache Hadoop– Cloud

大数据总结 (zz)

最近刚看了新的一期《程序员》杂志的一篇大数据的文章,总结的特别好,为了方便我去查找所有将里面的内容再次精简后写下来。 在这篇文章里主要是几个方面: 数据传输、数据存储、数据计算、数据展现、数据开发平台、数据应用市场 我之前对数据的总结在:数据存储、数据管理、数据计算 数据传输包括:实时同步、批量同步。一般常用方式采用时间线。 数据存储包括:内核级分布式存储、用户级分布式文件存储、业务级数据存储。


什么是Spark? 当然这里说的Spark指的是Apache Spark,Apache Spark™is a fast and general engine for large-scale data processing: 一种快速通用可扩展的数据分析引擎。如果想要搞清楚Spark是什么,那么我们需要知道它解决了什么问题,还有是怎么解决这些问题的。 Spark解决了什么问题? 在这里不得不提大数据

虚拟化+Hadoop VMware破解大数据之道

  云计算成为IT应用的趋势,大数据的爆发正在冲击传统数据处理和企业应用,大数据和云计算相遇会产生怎样的剧变呢?日前,云计算虚拟化行业巨头VMware与EMC共同举办了云时代的大数据——VMware&EMC大数据云高峰论坛,向业界和企业用户阐述了大数据蕴藏的巨大商机,以及VMware在大数据分析系统方面的最新进展。 ▲云时代的大数据——VMware&EMC大数据云高峰论坛现场   


文/ 占超群  “大数据”概念于20世纪90年代被提出,最初只是对一些在一定时间内无法用传统方法进行抓取、管理和处理的数据的统称。随着时间的推移和科技的发展以及物联网、移动互联网、SNS的兴起,每年产生的数据量都以几何级数增长,《IDC Digital Universe in 2020》报告称全球产生的数据将在2020年达到40ZB(1ZB=10亿TB=100万PB)。在这急剧增长的数据面前,各种


转自:http://www.cnw.com.cn/weekly/htm2012/20120525_247610.shtml 任何一个时代或者模式的兴起,都离不开与之相关的Killer App,比如,C/S时代的SAP ERP,互联网 1.0 时代的门户,以及互联网 2.0时代的搜索和SNS等。那么在当今云计算时代有哪些Killer App呢?当然,首先想到的肯定是以VMware 和Amazon E




java线程类为:java.lang.Thread,其实现java.lang.Runnable接口。 线程在运行过程中有6种状态,分别如下: NEW:初始状态,线程被构建,但是还没有调用start()方法 RUNNABLE:运行状态,Java线程将操作系统中的就绪和运行两种状态统称为“运行状态” BLOCK:阻塞状态,表示线程阻塞


默认情况下redis数据库充当slave角色时是只读的不能进行写操作,如果写入,会提示以下错误:READONLY You can't write against a read only slave.> set k3 111  (error) READONLY You can't write against a read only slave. 如果你要开启从库




​在TCP/IP的基于流的传输中,接收的数据被存储到套接字接收缓冲器中。不幸的是,基于流的传输的缓冲器不是分组的队列,而是字节的队列。 这意味着,即使将两个消息作为两个独立的数据包发送,操作系统也不会将它们视为两个消息,而只是一组字节(有点悲剧)。 因此,不能保证读的是您在远程定入的行数据








最新版本的Netty 4.x和JDK 1.6及更高版本



HttpClient 上传文件

我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时,添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来,使用RequestBuilder创建一个HTTP请求,并分配先前创建的HttpEntity。


查看当前使用的数据库    > db    test  切换数据库   > use foobar    switched to db foobar  插入文档    > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new


什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。


安装 下载MongoDB的安装包:mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi,按照提示步骤安装即可。 安装完成后,软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe,为了方便我们每次启动,我

Spring boot整合MyBatis-Plus 之二:增删改查

基于上一篇springboot整合MyBatis-Plus之后,实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId;
import com.baomidou.mybatisplus.annotations.TableName;
import com.baom


基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点: 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案