??xml version="1.0" encoding="utf-8" standalone="yes"?>
Ƣ迎来到Hadoop周刊周一特别版。本周有大量来自Spark?/span>Kafka?/span>Beam?/span>Kudu的技术新闅R如果你正在L一些更前沿的技术,Apache MetronQ孵化中Q发布了(jin)它们W一个版本?/span>MetronQ是一个构建在Hadoop上正在不断发展的通用安全pȝ?/span>
技术新?/span>
本文介绍?jin)如何?/span>AWS上构建流式处理系l。包括了(jin)诸如Amazon Kinesis ?/span>AWS Lambda?/span>Kineses S3 connector之类单的搭配Ҏ(gu)Q也介绍?/span>AWS实现实时分析场景q样相对复杂点的Ҏ(gu)?/span>
本文介绍?jin)怎样使用Spark Testing Base?/span>Spark Testing Base是一个用Scala~写Q通过Java调用?/span>Spark试框架。本文的样例代码展示?jin)如何隔L试逻辑重构Spark代码Q同时还通过Java处理?jin)一些臃肿的Scala API?/span>
http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
Altiscale博客概述?jin)?/span>Spark环境下,构徏thin?/span>uber jar包的优劣。示范了(jin)?/span>Maven?/span>SBT分别构徏两种包的情况?/span>
https://www.altiscale.com/blog/spark-on-hadoop-thin-jars/
LinkedIn介绍?jin)他们?/span>Kafka生态系l,生态系l包含一个特D的Kafka producerQ一个ؓ(f)?/span>Java客户端提供的REST APIQ一?/span>avro模式注册表,以及(qing)GobblinQ装载数据到Hadoop的工P(j){等?/span>
https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin
?/span>Spark Streaming教程介绍?jin)怎样通过twitter4j API拉推文,Z标签qo(h)Q对推文q行情感分析?/span>
https://www.mapr.com/blog/spark-streaming-and-twitter-sentiment-analysis
Apache KuduQ孵化中Q是Apache ImpalaQ孵化中Q的l佳伴GQ因为它能高效地解决q泛的分析和有针Ҏ(gu)的查询。本文描qC(jin)两者集成的技术细节,例如Kudu的设计如何保证高效地查询能力Q如何通过Impala?/span>Kudu执行写/更新Q删除操作等{?/span>
http://blog.cloudera.com/blog/2016/04/how-to-use-impala-and-kudu-together-for-analytic-workloads/
MapR撰文介绍?jin)?/span>spark-sklearn扩展一个已存在?/span>scikit-learn模型。文章介l了(jin)如何透过Airbnb数据集内部徏模,q介l了(jin)如何傍着spark-sklearnq行交叉验证?/span>
https://www.mapr.com/blog/predicting-airbnb-listing-prices-scikit-learn-and-apache-spark
AWS大数据博客写?jin)个如何?/span>Amazon EMR中?/span>HBase?/span>Hive的教E。本教程介绍?/span>HBaseQ描qC(jin)如何?/span>S3中恢?/span>HBase表,C?/span>Hive?/span>HBase如何集成{等?/span>
本文描述?jin)?f)学生在大数据评上提供实战经验的挑战。作者经历若q次的P代和选择g有了(jin)一个好Ҏ(gu)— Altiscale?/span>Hadoop-as-a-Service?/span>
https://www.altiscale.com/blog/hadoop-as-a-service-in-the-classroom/
Cloudera博客的一客做文章,作者比较了(jin)Parquet?/span>Avro在跨两个数据集的不同处理方式Q一个数据集H?/span>(3?/span>)、一个数据集?/span>(103?/span>)Q。在?/span>Spark?/span>Spark SQL试查询Q操作后Q作者发?/span>Parquet?/span>Avro在查询序列化数据斚w有时表现很类|管在大多数情况下查?/span>Parquet数据的时候更快点Q序列化数据更小Q?/span>
http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/
本文介绍?jin)如何?/span>CDHq样的分布式环境中?/span>SparkRQ尽?/span>SparkR官方q没有支持这U方式。借助YARN?/span>worker本地安装R语言包,jobE加攚w就能执行了(jin)?/span>
http://www.nodalpoint.com/sparkr-in-cloudera-hadoop/
很多开源框枉能执?/span>MapReduce以及(qing)借助更高U的~程模型完成cM的工作。纵观过去,它们依赖独立q行的框Ӟ例如MapReduce, StormQ,但是最q的某些变化使得q一切充满了(jin)变数?/span>Apache BeamQ孵化中Q更q一步地跨越?jin)批处理、流式处理两U执行模式,内置更加复杂的计模型?/span>
http://www.datanami.com/2016/04/22/apache-beam-emerges-ambitious-goal-unify-big-data-development/
Apache博客发布?/span>HBase?/span>HDD?/span>SSD以及(qing)RAMDISK上的写入性能试比对?/span>7系列文章。通过q一分析Q作者发现ƈ提议?/span>HBase?/span>HDFS上实C些未覆盖的功能?/span>
https://blogs.apache.org/hbase/entry/hdfs_hsm_and_hbase_part
其他新闻
Tom WhiteQ?/span>“Hadoop权威指南”的作者撰文介l他是如何步?/span>Apache HadoopD堂的。他的早期A(ch)献是l着Hadoop?/span>Amazon Web Services集成展开Q而今AWS已成?/span>Hadoop目成功的重要部分?/span>
http://vision.cloudera.com/how-i-got-into-hadoop/
FluoQؓ(f)Apache Accumulo准备的分布式处理引擎Q向Apache孵化器提交了(jin)孵化甌?/span>
https://wiki.apache.org/incubator/FluoProposal
Apache Phoenix宣布?yu)?/span>HBaseCon后D行会(x)议,Apache Phoenix是一?/span>SQL-on-HBasepȝ。该?x)议只有半天Q主题是介绍Phoenix内部情况和用例?/span>
http://hortonworks.com/blog/announcing-first-annual-phoenixcon-apache-phoenix-user-conference/
产品发布
Apache MetronQ构ZHadoop上的安全框架Q发布了(jin)0.1版?/span>Hortonworks支撑其作为技术预览版Qƈ撰写本文介绍?jin)如何上手,如何贡献Q如何?/span>Metron UI{等?/span>
http://hortonworks.com/blog/apache-metron-tech-preview-1-come-get/
http://hortonworks.com/blog/apache-metron-use-case-finding-needle-haystack/
Apache NiFi本周发布?/span>0.6.1版。这是修复了(jin)10多个bug后的修复版?/span>
Apache Flink本周发布?/span>1.0.2版。本ơ发布包括了(jin)bug修复Q?/span>RocksDB环境下的性能提升以及(qing)一些文档方面的q步?/span>
http://flink.apache.org/news/2016/04/22/release-1.0.2.html
Amazon发布?jin)新?/span>Amazon EMRQ开始支?/span>HBase 1.2?/span>
https://aws.amazon.com/blogs/aws/amazon-emr-update-apache-hbase-1-2-is-now-available/
zd
中国
?/span>
2016q?/span>4?/span>17?/span>
启明星辰——q_和大数据整体l编?nbsp;
Hortonworks在本?/span>HadoopƧ洲C(x)上有若干爆料Q诏I了(jin)本期整个内容。伴随着?jing)h的新Ҏ(gu),Apache Storm发布?/span>1.0.0版。在技术新L面,有不基?/span>Kafka构徏大规模服务和分布式系l测试的文章。如果你错过?/span>HadoopC(x)Q那么不用担?j),演讲视频已经攑ֈ了(jin)网上?/span>
技术新?/span>
Smyte撰文介绍?jin)他们基于事件数据流实时(g)垃N件和诈骗信息的基设施。最初的事g处理pȝ构徏?/span>Kafka?/span>Redis?/span>Secor以及(qing)S3上,Z(jin)满规模不断扩张和廉L(fng)要求Q他们把pȝq移到基于磁盘的Ҏ(gu)上,使用Redis协议?/span>RocksDB交互Q?/span>Kafkaq行复制?/span>
https://medium.com/the-smyte-blog/counting-with-domain-specific-databases-73c660472da
本文?/span>rsyslog?/span>Kafka?/span>AWS ?/span>ELK栈(ElasticSearch?/span>Logstash?/span>KibanaQ结合,处理诸如反压、规模以?qing)维护方面的问题。本文覆盖了(jin)rsyslog集成Kafka以及(qing)schema斚w的技巧,也介l了(jin)如何q行Kafka?/span>Zookeeper以及(qing)AWS中大规模自动分组?/span>
https://www.bashton.com/blog/2016/elk-on-ark/
Hortonworks撰文介绍?/span>Apache Atlas以及(qing)Apache Range要引入的数据管理特性。这些特性是Q分c访问控制、数据有效期{略、位|特性策略、禁止数据集l合、跨lg家族Q例如从Kafka?/span>Storm再到Hive的数据跟t)(j)?/span>
http://hortonworks.com/blog/the-next-generation-of-hadoop-based-security-data-governance/
Apache HAWQ Q孵化中Q是一个基?/span>Greenplum?/span>HDFS上提供数据查询的SQL引擎。本文讨Z(jin)其典型设计以?qing)新版本的诸多改q。包括它?/span>Spark?/span>MapReduce的区别,q有?/span>Hadoop挑战l典MPP设计的内容,以及(qing)HAWQ的新设计怎样l合MPP和批处理技术进而其两者兼?/span>
Cloudera博客撰文介绍?jin)?/span>Hadoop分布式系l进行故障注入、组|的试工具AgenTEST。它能注入网l故障(例如丢包Q,资源满蝲Q例?/span>CPU?/span>IO、磁盘空_(d)(j){等。当试|络分区Ӟ可以评估环Şl网、桥接组|等{?/span>
Hortonworks博客展望?jin)将包含新版?/span>Spark?/span>Zeppelin?/span>HDP 2.4.2?/span>Spark2.0预览版和Zeppelin新特性都包含在内?/span>
http://hortonworks.com/blog/apache-spark-apache-zeppelin-whats-coming-in-hdp-2-4-2/
Cask撰文介绍?jin)?/span>Hbase region compactionq样|见事g发生的前后,他们是怎样通过长时间测试以评估分布式系l正性的?/span>
http://blog.cask.co/2016/04/long-running-tests-in-cdap/
本文介绍?jin)如何结?/span>SparkR与亚马?/span>EMRq行地理I间分析的。通过SparkR?/span>Hive集成lgQ可以立d?/span>S3上的数据映射Hive外部表。从q开始,数据p直接加蝲到内存中使用R语言分析Q很Ҏ(gu)实现高质量的数据可视化?/span>
MapR~写?jin)?/span>Pig?/span>Hive分析职业球大联盟球队水q的教程?/span>Pig用于数据初加工,Hive提供ZSQL的数据查询环境。借助Hive ODBC驱动?/span>Hive服务器,使得微YExcel也能用于获取和分析数据?/span>
https://www.mapr.com/blog/using-hive-and-pig-baseball-statistics
SignalFX通过27节点?/span>Kafka集群每天处理700多亿条消息。只有基于他们积累的大规?/span>Kafka使用l验才能有如此高的量Q因此他们共享了(jin)不少调试Kafka的技巧,定位告警Q例如日志刷新gq增加)(j)Q以?/span>Kafka横向扩展?/span>
http://www.confluent.io/blog/how-we-monitor-and-run-kafka-at-scale-signalfx
dataArtisan's博客Z(jin)度量Flink在数据流效率、低延迟、正性上的能力,专门写了(jin)q篇文章。ؓ(f)?jin)证明效率,在高吞吐量的环境下运行?jin)最新的Yahoo!式基准试E序。在正确性方面,文章H出?/span>Flink事g判别和处理事Ӟ星球大战?sh)?jing)q表做类比)(j)斚w的优ѝ最后,文章描述?/span>Flink未来版本Z内存的查询Q务?/span>
http://data-artisans.com/counting-in-streams-a-hierarchy-of-needs/
本教E介l了(jin)怎样?/span>TCP Socket中的文本数据{换ؓ(f)Spark式数据源?/span>
https://medium.com/@anicolaspp/spark-custom-streaming-sources-e7d52da72e80
本文介绍?jin)在构?/span>Hadoop的时候怎样防止AWS证书意外提交到补丁或git资源库。除Hadoop本n外,本文q徏议?/span>“git-secrets”工具防止意外提交讉K/安全密钥。如果你用的?/span>Hadoop S3Q还推荐?jin)新补丁供评估?/span>
http://steveloughran.blogspot.co.uk/2016/04/testing-against-s3-and-object-stores.html
Big Data & Brews采访?/span>MapR?/span>Ted Dunning?/span>Jacques Nadeau?/span>Apache Arrow也在本次采访范围内?/span>
https://www.youtube.com/watch?v=l3mDDKjDjMk
https://www.youtube.com/watch?v=Xo9CO0a0VJI
其他新闻
DataEngConf最q在旧金山召开。本文ȝ?/span>Uber?/span>Stripe?/span>Microsoft?/span>Instacart?/span>Jawbone的发a内容。也介绍?jin)?x)议主?/span>“数据U学在现实世界中是一个品和工程学科”?/span>
Hortonworks在上周都柏林举行?/span>HadoopƧ洲C(x)上大攑ּ彩?/span>ZDNet报导?jin)这些亮点,其中包括?/span>PivotalQ已转售l?/span>HDPQ的扩展合作Q与Syncosrt的{售协议,以及(qing)Atlas?/span>Ranger?/span>Zeppelin?/span>Metron的技术预览。报D介绍?/span>Hortonworks?/span>Cloudera?/span>MapR产品的不同之处?/span>
Flink 2016C(x)在?ji)月于d国柏林D行。讨题征集将于六月末l束?/span>
http://flink.apache.org/news/2016/04/14/flink-forward-announce.html
YouTube上发布了(jin)Hadoop都柏林峰?x)演讲视频。正如预期的那样Q这些演讲内Ҏ(gu)?/span>Hadoop生态系l的各个部分?/span>
产品发布
Metascope是一个配?/span>Schedoscope?/span>Hadoop集群中进行元数据理的新工具。通过web界面Q利用数据沿袭它能洞察大量的数据。也提供(g)索、内嵌文档?/span>REST API{等功能?/span>
https://github.com/ottogroup/metascope
Apache HBase 1.2.1于本周发布,?/span>1.2.0的基上解决了(jin)27个问题。发布声明中重点介绍?jin)四个高优先U的问题?/span>
Apache Mahout机器学习(fn)库发布了(jin)0.12.0版。该版本?/span>“Samsara”数学环境开始支?/span>Apache Flink?jin),q且是^台无关的。发布声明中分n?jin)?/span>Flink集成、已知问题、项目演q计划相关的内容?/span>
Apache Storm 1.0.0本周发布?jin)。亮点包括性能提升Q普遍提?/span>3倍以上)(j)、新的分布式~存API?/span>nimbus的高可用性、自动反压、动?/span>worker性能分析{等?/span>
http://storm.apache.org/2016/04/12/storm100-released.html
Apache KuduQ孵化中Q本周发布了(jin)0.8.0版。本ơ发布添加了(jin)Apache Flume sink、部分功能提升、修复了(jin)一?/span>bug?/span>
http://getkudu.io/releases/0.8.0/docs/release_notes.html
Cloudbreak本周发布?/span>1.2版,它ؓ(f)云环境提?/span>Hadoop集群Docker。新Ҏ(gu)包括支?/span>OpenStack以及(qing)定义服务器提供配|脚本?/span>
http://hortonworks.com/blog/announcing-cloudbreak-1-2/
Cloudera发布?/span>Cloudera Enterprise 5.4.10Q内|了(jin)Flume?/span>Hadoop?/span>HBase?/span>Hive?/span>Impala{组件?/span>
Presto Accumulo是个新项目,?/span>Accumulod数据提供?/span>Prestoq接器?/span>
https://github.com/bloomberg/presto-accumulo
zd
中国
?/span>
W?165 ?2016q??0?
启明星辰——q_和大数据整体l编?/strong>
本周Q包?/span>LinkedIn ?/span>Airbnb新开源项目在内的C产品q行?jin)重大版本发布。本期技术部分与式处理有关——Spark?/span>Flink?/span>Kafka{等Q新闻部分是关于Spark Summit ?/span>HbaseCon的会(x)议议E?/span>
Zalando发表?jin)他们是如何选择Apache Flink作ؓ(f)式处理框架的文章。该文章阐述?jin)对评h(hun)标准q行验证后得出的l论Q阐明了(jin)选择Apache Flink的主?/span>—在高吞吐量的情况下依然能保持低gq,真正的流式处理,开发h员支持?/span>
https://tech.zalando.com/blog/apache-showdown-flink-vs.-spark/
Cloudera博客刊登?jin)来?/span>Wargaming.net的文章,通过本文可了(jin)解到他们如何通过Kafka?/span>HBase?/span>Drools?/span>Spark构徏实时处理基础设施的。另外,在数据流E方面,他们介绍?jin)如何?/span>HBase的检索和序列化?/span>HBase?/span>Spark之间的数据本地化以及(qing)Spark计算斚w的优化措施?/span>
http://blog.cloudera.com/blog/2016/04/inside-wargamings-data-driven-real-time-rules-engine/
InfoQ发布?jin)大规模式处?/span>—SMACKQ?/span>Spark?/span>Mesos?/span>Akka?/span>Cassandra以及(qing) KafkaQ栈的介l视频。讨Z(jin)Z?/span>SMACK栈在处理同样问题的时候比Lambda架构更简单?/span>
http://www.infoq.com/presentations/stream-analytics-scalability
Confluent“日志压羃”pd博文又有更新Q介l了(jin)Kafka目三月份发生的事情。有不少令hx的开发内容,包括机架感知?/span>Kerberos支持、基于时间烦(ch)引方面的q展。以?qing)不你Q我也是Q没有时间持l关注的最新研发成果?/span>
Apache Flink 1.0引入?jin)新的复杂事件处理?/span>CEPQ库。啰嗦几句,CEP提供?jin)一U检事件模式的Ҏ(gu)。本文借助传感器从数据中心(j)服务器上攉数据Q运用一U可能的异常(g)用例,诠释?/span>Flink?/span>CEP模式API ?/span>
http://flink.apache.org/news/2016/04/06/cep-monitoring.html
Genome Analysis Toolkit Q?/span>GATKQ最q宣布,下一个版本(当前?/span>alphaQ将支持Apache Spark。本文简要介l了(jin)工具ƈ展示?jin)怎样通过Spark来检重?/span>DNA片段的?/span>
InfoWorldlD?/span>Spark2.0关于l构化流式处理方面的计划。微批处理将依然延箋Q还有些新特性,例如无限数据帧(Infinite DataFramesQ、一的重复查询支持?/span>
AWS大数据博客发布了(jin)一通过存储?/span>AWS Key Management Service Q?/span>KMSQ中的加密密钥加载数据到S3?/span>Redshift的文章。除?jin)描q所需步骤Q本文还介绍?jin)如何?/span>AWS S3中通过KMS密钥加密数据?/span>
Confluent博客介绍?jin)如何?/span>Kafka Connect ?/span> Kafka Streams ~写非凡?/span>“hello world”E序。更切地说Q范例程序从IRC拉维基百U数据,q解析消息、进行多斚w的统计计。本文还用了(jin)若干E序展示?jin)整个实现过E?/span>
http://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams
本文?/span>Postgres ?/span> Cassandra转换单的模式Q?/span>schemasQ,q描qC(jin)主要的差?/span>—复制、数据类型(Cassandra不支?/span>JSONQ、主键、最l以一致性?/span>
http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/
ESG博客报导?jin)最q?/span>Strata+Hadoop World大会(x)的情c(din)ƈ有些重点xQ例?/span>Spark的良好势头、机器学?fn)、云服务?/span>
http://blog.esg-global.com/riding-high-at-stratahadoop-world
InformationWeek也报g(jin)Strata大会(x)Q关注了(jin)MapR?/span>Pivotal的关灯片、h工智能等?/span>
Spark Summit 2016议程敲定Q将?/span>6?/span>6-8日在旧金׃D行。会(x)议将有两天展开五个方向的讨论?/span>
https://databricks.com/blog/2016/04/04/agenda-announced-for-sparksummit-2016-in-san-francisco.html
布斯采访了(jin)Cloudera CEO Tom ReillyQ他讨论?jin)公司的机遇、竞争性市(jng)场、上?jng)计划等?/span>
Datanami撰文正在崛L(fng)Apache Kafka作ؓ(f)式处理的支柱。文章还采访?/span>Confluent联合创始人兼CTO Neha NarkhedeQ坊间她表示最q将推出Kafka Connect ?/span> Kafka Streams?/span>
http://www.datanami.com/2016/04/06/real-time-rise-apache-kafka/
HBaseCon于5?/span>24日在旧金山召开Q最q议E才正式宣布。在三个方向上,有20个以上的议题要讨论?/span>
http://blog.cloudera.com/blog/2016/04/hbasecon-2016-speaker-lineup-announced/
Apache HBase 0.98.18 ?/span>1.1.4最q都发布?jin)?/span>1.1.4上有包括?ji)个或正性在内的若干修复?/span>HBase 0.98.18答{的仅解决了(jin)50个问题(bug、改善两个新Ҏ(gu))(j)?/span>
http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCANZa%3DGu-mAxKEtfoRjctHcE0KD7z52oE010Fgsf6AMmW2tDZLA%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCA%2BRK%3D_CtZ1L07nS6Og2ekfVwet0qTE7jw-bmyD2pp5UPweUehQ%40mail.gmail.com%3E
Apache Lens发布?/span>2.5.0-betaQ作为统一分析接口Q它已经支持Hadoop生态系l的执行引擎数据存储?jin)。本ơ发布解决了(jin)87,主要?/span>bug修复和实现新功能?/span>
Airbnb 开源了(jin) CaravelQ数据探索系l(数据可视化^収ͼ(j)?/span>Caravel支持多种在商业品上才能看到的特性,能够q接CQ意只要支?/span>SQL方言的系l。尤其它支持面向Druid的实时分析?/span>
https://medium.com/airbnb-engineering/caravel-airbnb-s-data-exploration-platform-15a72aa610e5
MapR 宣布支持Apache Drill 1.6作ؓ(f)他们的分布式pȝ。比较有亮点的发布有MapR-DB新存储插件、新SQLH口函数支持以及(qing)端对端安全。在|页介绍部分Q有些?/span>MapR-DB API?/span>?/span>数据q?/span>q?/span>Drill查询的例子?/span>
Apache Flink发布?jin)修?/span>bug后的1.0.x。这ơ发布解决了(jin)23个问题,推荐所?/span>1.0.0的用户升U?/span>
http://flink.apache.org/news/2016/04/06/release-1.0.1.html
Cloudera Enterprise 5.7发布附带?/span>Spark?/span>HBase?/span>Impala?/span>Kafka{组件版本的升。本ơ发布的亮点包括?/span>Cloudera Labs 新鲜推荐?/span>Hive-on-Spark?/span>HBase-Spark?/span>Impala性能重要提升Q支?/span>SSD ?/span>HBase WAL?/span>
http://blog.cloudera.com/blog/2016/04/cloudera-enterprise-5-7-is-released/
Apache TajoQ构建在Hadoop上的数据仓库pȝQ发布了(jin)0.11.2版。新版本支持?/span>KerberosQ修复了(jin)ORC表对Hive的支持等?/span>
http://tajo.apache.org/releases/0.11.2/announcement.html
LinkedIn 开源了(jin) Dr. ElephantQ里面的工具能诊?/span>Hadoop?/span>Sparkd的性能问题。基?/span>metrics?/span>YARN资源理器收集已完成d数据Q?/span>Dr. Elephant评估后生成诊断报表,内容包括数据错位?/span>GC开销{?/span>LinkedIn宣称借助它能解决80%的问题?/span>
中国
?/span>