Contact

R20/Consultancy

+31 70 3978466

info@r20.nl

 

 

 

 

 

Title: Overview of Big Data and Fast Data Technology

Subtitle: Hadoop, Spark, Kafka, NoSQL, and many more

Introduction

With the introduction of big data, a tsunami of new technologies for data storage, processing, and transportation was introduced. Hadoop, Spark, Kafka, NoSQL, MapReduce, Hive, SQL-on-Hadoop are just a few of the countless technologies that have become available for developing big data systems. And with streaming data and the Internet of Things fast data has attracted the attention of many organizations as well.

Most of these new technologies, but which ones do you pick? Due to this waterfall of new technologies, it’s becoming harder and harder for organizations to select the right tools. Which technologies are relevant? Are they mature? What are their use cases? These are all valid but difficult to answer questions.

This seminar gives a clear, extensive, and critical overview of all the new key technologies for developing big data and fast data systems. Technologies are explained, market overviews are presented, strengths and weaknesses are discussed, and guidelines and best practices are given. It’s the perfect update for those interested in knowing how to develop big data and fast systems.

Subjects

1. Introduction to Big Data and Fast Data

  • New analytical needs, including data science, investigative analytics, and streaming analytics
  • Deploying big data to get a competitive advantage
  • Differences between semi-structured, poly-structured, multi-structured, and unstructured data
  • Examples of big data: sensor data, (micro-)event data, and clickstream data
  • Fast data = big data + fast analytics + fast reactions
  • The importance of scalability and query performance

2. The World of Hadoop, NoSQL, and Spark Explained

  • The Hadoop stack: HDFS, MapReduce, Hive, Spark, HBase, YARN, ZooKeeper, Pig, HCatalog, and so on
  • Alternative implementations of MapR, Amazon (Hadoop as a service), ScaleOut (Hadoop in-memory)
  • MapReduce or Spark for analytics and reporting?
  • Classification of NoSQL products: key-value stores, document stores, column-family stores, and graph data stores
  • Market overview including: Apache HBase, Cassandra, CouchDB, Cloudera, DataStax, MongoDB, Neo4j, and Riak
  • Using Spark for big in-memory analytical processing
  • The interfaces of Spark: SQL, R, Scala, Python

2. Overview of Analytical SQL Database Servers

  • Are classic SQL database servers more suitable for data warehousing?
  • Important performance improving features: column-oriented storage, in-database analytics
  • Market overview of analytical SQL database servers, Apache Greenplum, Exasol, HP Vertica, IBM PureData Systems for Analytics, InfoBright, JustOneDB, Kognitio WX2, Microsoft PDW, Oracle In-Memory, SAP HANA and Sybase IQ, SnowflakeDB, Teradata Appliance, and Teradata Aster Database  

3. Big SQL Solutions: SQL-on-Hadoop, NewSQL, and analytical SQL Database Servers

  • How mature are the current SQL-on-Hadoop engines?
  • Market overview of SQL-on-Hadoop engines, including Apache Drill, Apache Hive, Apache Phoenix, Cloudera Impala, HP Vertica, JethroData, Spark SQL, and Splice Machine
  • Classification of analytical SQL database servers
  • The pros and cons of column-based data storage
  • What is in-database analytics and what's the relationship with Google’s MapReduce?
  • Market overview of analytical database servers, including Apache Greenplum, Exasol, HP Vertica, IBM PureData Systems for Analytics, InfoBright, JustOneDB, Kognitio WX2, Microsoft PDW, Oracle In-Memory, SAP HANA en Sybase IQ, SnowflakeDB, Teradata Appliances, and Teradata Aster Database
  • NewSQL means high-performance transaction-oriented SQL systems
  • Simpler transaction mechanisms to scale-out
  • Market overview of NewSQL systems, including Akiban, Clustrix, GenieDB, NuoDB, and VoltDB

4. Technologies for Fast Data and Streaming Analytics

  • The key use-case for fast data: the Internet of Things (IoT)
  • IoT implies streaming data and fast analysis of data - analytics at the speed of business
  • IoT devices: Smartphones (watches), RFID sensors, machines, general sensors, cameras, pace makers, and so on
  • The challenge: real-time reactions on streaming data
  • The difference between big data and fast big data
  • Technologies forstreaming data: Apache Kafka, Apache ActiveMQ, Amazon Kinesis, Kestrel, RabbitMQ, and ZeroMQ
  • Differences between these new technologies and traditional message queuing products
  • Products for big data streaming: Apache Storm and Flink, IBM InfoSphere Streams, Informatica for Streaming Analytics, Software AG Apama, and Spark Streaming
  • How to integrate fast data with the enterprise data warehouse?

5. Developing Data Lakes with Big Data Technology

    • What is a data lake?
    • Which technologies are suitable for developing data lakes?
    • Is it realistic to develop one large physical data lake containing big data?
    • Developing a virtual or logical data lake with data virtualization servers
    • How to deal with technical and business meta data?

    6. Data Science, Big Data Technology, and the Data Warehouse

    • What is data science and why is it different from analytics?
    • What do MapReduce and Spark have to offer the data scientists?
    • Can we use popular BI tools, such as QlikView and Tableau, together with Spark?
    • Hadoop as sandbox for advanced forms of analytics
    • The value of graph databases for data science, such as AllegroGraph, InfiniteGraph, and Neo4J

    7. Data modeling for Big Data, Hadoop, and NoSQL

    • Explanation of non-relational concepts, such as column families, hierarchies, sets, and lists
    • Is storing unstructured and semi-structured data really more flexible?
    • The differences between schema-on-read and schema-on-write
    • Rules for transforming classic data models to NoSQL concepts
    • Application needs influence database design

    8. Concluding Remarks

    Related Whitepapers:

     SQL Syntax for Apache Drill; Using SQL for the SQL-on-Everything Engine; December 2015; sponsored by DZone

     How Drill Enriches Self-Service Analytics; The Added Value of a SQL-on-Everything Engine; November 2015; sponsored by MapR Technologies

     SQL-on-Hadoop Engines Explained; May 2014; sponsored by MapR Technologies

     SAP HANA and Data Virtualization: Competitors or Complements?; September 2012; sponsored by Cisco (Composite Software)

     Mixed, Shifting, and High-Concurrency Workloads in Data Warehouse Systems; July 2012; sponsored by Teradata Corporation

     Using SQL-MapReduce for Advanced Analytical Queries - Second Edition; September 2011; sponsored by Teradata InfiniteGraph: Extending Business, Social, and Government Intelligence with Graph Analytics; September 2010; sponsored by InfiniteGraph