Overview of Big Data Technology

Contact

R20/Consultancy

+31 70 3978466

Title: Overview of Big Data and Fast Data Technology

Subtitle: Hadoop, Spark, Kafka, NoSQL, and many more

Introduction

With the introduction of big data, a tsunami of new technologies for data storage, processing, and transportation was introduced. Hadoop, Spark, Kafka, NoSQL, MapReduce, Hive, SQL-on-Hadoop are just a few of the countless technologies that have become available for developing big data systems. And with streaming data and the Internet of Things fast data has attracted the attention of many organizations as well.

Most of these new technologies, but which ones do you pick? Due to this waterfall of new technologies, it’s becoming harder and harder for organizations to select the right tools. Which technologies are relevant? Are they mature? What are their use cases? These are all valid but difficult to answer questions.

This seminar gives a clear, extensive, and critical overview of all the new key technologies for developing big data and fast data systems. Technologies are explained, market overviews are presented, strengths and weaknesses are discussed, and guidelines and best practices are given. It’s the perfect update for those interested in knowing how to develop big data and fast systems.

Subjects

1. Introduction to Big Data and Fast Data

New analytical needs, including data science, investigative analytics, and streaming analytics
Deploying big data to get a competitive advantage
Differences between semi-structured, poly-structured, multi-structured, and unstructured data
Examples of big data: sensor data, (micro-)event data, and clickstream data
Fast data = big data + fast analytics + fast reactions
The importance of scalability and query performance

2. The World of Hadoop, NoSQL, and Spark Explained

The Hadoop stack: HDFS, MapReduce, Hive, Spark, HBase, YARN, ZooKeeper, Pig, HCatalog, and so on
Alternative implementations of MapR, Amazon (Hadoop as a service), ScaleOut (Hadoop in-memory)
MapReduce or Spark for analytics and reporting?
Classification of NoSQL products: key-value stores, document stores, column-family stores, and graph data stores
Market overview including: Apache HBase, Cassandra, CouchDB, Cloudera, DataStax, MongoDB, Neo4j, and Riak
Using Spark for big in-memory analytical processing
The interfaces of Spark: SQL, R, Scala, Python

2. Overview of Analytical SQL Database Servers

Are classic SQL database servers more suitable for data warehousing?
Important performance improving features: column-oriented storage, in-database analytics
Market overview of analytical SQL database servers, Apache Greenplum, Exasol, HP Vertica, IBM PureData Systems for Analytics, InfoBright, JustOneDB, Kognitio WX2, Microsoft PDW, Oracle In-Memory, SAP HANA and Sybase IQ, SnowflakeDB, Teradata Appliance, and Teradata Aster Database

3. Big SQL Solutions: SQL-on-Hadoop, NewSQL, and analytical SQL Database Servers

How mature are the current SQL-on-Hadoop engines?
Market overview of SQL-on-Hadoop engines, including Apache Drill, Apache Hive, Apache Phoenix, Cloudera Impala, HP Vertica, JethroData, Spark SQL, and Splice Machine
Classification of analytical SQL database servers
The pros and cons of column-based data storage
What is in-database analytics and what's the relationship with Google’s MapReduce?
Market overview of analytical database servers, including Apache Greenplum, Exasol, HP Vertica, IBM PureData Systems for Analytics, InfoBright, JustOneDB, Kognitio WX2, Microsoft PDW, Oracle In-Memory, SAP HANA en Sybase IQ, SnowflakeDB, Teradata Appliances, and Teradata Aster Database
NewSQL means high-performance transaction-oriented SQL systems
Simpler transaction mechanisms to scale-out
Market overview of NewSQL systems, including Akiban, Clustrix, GenieDB, NuoDB, and VoltDB

4. Technologies for Fast Data and Streaming Analytics

The key use-case for fast data: the Internet of Things (IoT)
IoT implies streaming data and fast analysis of data - analytics at the speed of business
IoT devices: Smartphones (watches), RFID sensors, machines, general sensors, cameras, pace makers, and so on
The challenge: real-time reactions on streaming data
The difference between big data and fast big data
Technologies forstreaming data: Apache Kafka, Apache ActiveMQ, Amazon Kinesis, Kestrel, RabbitMQ, and ZeroMQ
Differences between these new technologies and traditional message queuing products
Products for big data streaming: Apache Storm and Flink, IBM InfoSphere Streams, Informatica for Streaming Analytics, Software AG Apama, and Spark Streaming
How to integrate fast data with the enterprise data warehouse?

5. Developing Data Lakes with Big Data Technology

What is a data lake?
Which technologies are suitable for developing data lakes?
Is it realistic to develop one large physical data lake containing big data?
Developing a virtual or logical data lake with data virtualization servers
How to deal with technical and business meta data?

6. Data Science, Big Data Technology, and the Data Warehouse

What is data science and why is it different from analytics?
What do MapReduce and Spark have to offer the data scientists?
Can we use popular BI tools, such as QlikView and Tableau, together with Spark?
Hadoop as sandbox for advanced forms of analytics
The value of graph databases for data science, such as AllegroGraph, InfiniteGraph, and Neo4J

7. Data modeling for Big Data, Hadoop, and NoSQL

Explanation of non-relational concepts, such as column families, hierarchies, sets, and lists
Is storing unstructured and semi-structured data really more flexible?
The differences between schema-on-read and schema-on-write
Rules for transforming classic data models to NoSQL concepts
Application needs influence database design

8. Concluding Remarks

Related Whitepapers:

SQL Syntax for Apache Drill; Using SQL for the SQL-on-Everything Engine; December 2015; sponsored by DZone

How Drill Enriches Self-Service Analytics; The Added Value of a SQL-on-Everything Engine; November 2015; sponsored by MapR Technologies

SQL-on-Hadoop Engines Explained; May 2014; sponsored by MapR Technologies

SAP HANA and Data Virtualization: Competitors or Complements?; September 2012; sponsored by Cisco (Composite Software)

Mixed, Shifting, and High-Concurrency Workloads in Data Warehouse Systems; July 2012; sponsored by Teradata Corporation

Using SQL-MapReduce for Advanced Analytical Queries - Second Edition; September 2011; sponsored by Teradata InfiniteGraph: Extending Business, Social, and Government Intelligence with Graph Analytics; September 2010; sponsored by InfiniteGraph