There are plenty of public datasets out there available and the number is growing. Few recent and most useful of BigData ecosystem tools are showcased: Apache Zeppelin (incubating), Apache Spark and Juju.
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
This document summarizes the traffic history and infrastructure changes at Yelp from 2005 to the present. It outlines the key milestones and technology changes over time as Yelp grew from handling around 200k searches per day with 1 database in 2005-2007 to serving traffic across 29 countries in 2014 with a distributed, scalable infrastructure utilizing technologies like Elasticsearch, Kafka, and Pyleus for real-time processing.
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks
This document discusses using Hadoop for archiving, e-discovery, and supervision. It outlines the key components of each task and highlights traditional shortcomings. Hadoop provides strengths like speed, ease of use, and security. An architectural overview shows how Hadoop can be used for ingestion, processing, analysis, and machine learning. Examples demonstrate surveillance use cases. While some obstacles remain, partners can help address areas like user interfaces and compliance storage.
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
The document discusses Tuplejump, a data engineering startup with a vision to simplify data engineering. It summarizes Tuplejump's big data pipeline platform which collects, transforms, predicts, stores, explores and visualizes data using various tools like Hydra, Spark, Cassandra, MinerBot, Shark, UberCube and Pissaro. It advocates using Scala as the primary language due to its object oriented and functional capabilities. It also discusses advantages of Tuplejump's platform and how tools like Akka, Spark, Play, SBT, ScalaTest, Shapeless and Scalaz are leveraged.
Apache Zeppelin is an emerging open-source tool for data visualization that allows for interactive data analytics. It provides a web-based notebook interface that allows users to write and execute code in languages like SQL and Scala. The tool offers features like built-in visualization capabilities, pivot tables, dynamic forms, and collaboration tools. Zeppelin works with backends like Apache Spark and uses interpreters to connect to different data processing systems. It is predicted to influence big data visualization in the coming years.
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
This document discusses patterns for modern data integration using streaming data. It outlines an evolution from data warehouses to data lakes to streaming data. It then describes four key patterns: 1) Stream all things (data) in one place, 2) Keep schemas compatible and process data on, 3) Enable ridiculously parallel single message transformations, and 4) Perform streaming data enrichment to add additional context to events. Examples are provided of using Apache Kafka and Kafka Connect to implement these patterns for a large hotel chain integrating various data sources and performing real-time analytics on customer events.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
Over the past couple of years, Scala has become a go-to language for building data processing applications, as evidenced by the emerging ecosystem of frameworks and tools including LinkedIn's Kafka, Twitter's Scalding and our own Snowplow project (https://github.com/snowplow/snowplow).
In this talk, Alex will draw on his experiences at Snowplow to explore how to build rock-sold data pipelines in Scala, highlighting a range of techniques including:
* Translating the Unix stdin/out/err pattern to stream processing
* "Railway oriented" programming using the Scalaz Validation
* Validating data structures with JSON Schema
* Visualizing event stream processing errors in ElasticSearch
Alex's talk draws on his experiences working with event streams in Scala over the last two and a half years at Snowplow, and by Alex's recent work penning Unified Log Processing, a Manning book.
Spark Summit EU talk by Oscar CastanedaSpark Summit
This document discusses running an Elasticsearch server inside a Spark cluster for development purposes. It presents the problem of Elasticsearch running outside of Spark making development cumbersome. The solution is to start an Elasticsearch server inside the Spark cluster that can be accessed natively from Spark jobs. Data can then be written to and read from this local Elasticsearch server. Additionally, snapshots of the Elasticsearch indices can be stored in S3 for backup and restoration. A demo application is shown that indexes tweets in real-time to the local Elasticsearch server running inside the Spark cluster.
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...HostedbyConfluent
Kafka is widely positioned as the proverbial "central nervous system" of the enterprise. In this session, we explore how the central nervous system can be used to build a mesh topology & unified catalog of enterprise wide events, enabling development teams to build event driven architectures faster & better.
The central theme of this topic is also aligned to seeking idioms from API Management, Service Meshes, Workflow management and Service orchestration. We compare how these approaches can be harmonized with Kafka.
We will also touch upon the topic of how this relates to Domain Driven Design, CQRS & other patterns in microservices.
Some potential takeaways for the discerning audience:
1. Opportunities in a platform approach to Event Driven Architecture in the enterprise
2. Adopting a product mindset around Data & Event Streams
3. Seeking harmony with allied enterprise applications
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
This document summarizes lessons learned about monitoring a data pipeline at Hulu. It discusses how the initial monitoring approach had some issues from the perspectives of users and detecting problems. A new approach is proposed using a graph data structure to provide contextual troubleshooting that connects any issues to their impacts on business units and user needs. This approach aims to make troubleshooting easier by querying the relationships between different components and resources. Small independent services would also be easier to create and maintain within this approach.
Taboola's data processing architecture has evolved over time from directly writing to databases to using Apache Spark for scalable real-time processing. Spark allows Taboola to process terabytes of data daily across multiple data centers for real-time recommendations, analytics, and algorithm calibration. Key aspects of Taboola's architecture include using Cassandra for event storage, Spark for distributed computing, Mesos for cluster management, and Zookeeper for coordination across a large Spark cluster.
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
A presentation prepared for Data Stack as a part of their Interview process on July 20.
This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release
This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.
This document discusses Apache Zeppelin, an open-source web-based notebook that enables interactive data analytics. It provides an overview of Zeppelin's history and architecture, including how interpreters and notebook storage are pluggable. The document also outlines Zeppelin's roadmap for improving enterprise support through features like multi-tenancy, impersonation, job management and frontend performance.
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
The document discusses building data pipelines in the cloud. It covers serverless data pipeline patterns using services like BigQuery, Cloud Storage, Cloud Dataflow, and Cloud Pub/Sub. It also compares Cloud Dataflow and Cloud Dataproc for ETL workflows. Key questions around ingestion and ETL are discussed, focusing on volume, variety, velocity and veracity of data. Cloud vendor offerings for streaming and ETL are also compared.
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
During the last year, the team at IBM Research at Ireland has been using Apache Spark to perform analytics on large volumes of sensor data. These applications need to be executed on a daily basis, therefore, it was essential for them to understand Spark resource utilization. They found it cumbersome to manually consume and efficiently inspect the CSV files for the metrics generated at the Spark worker nodes.
Although using an external monitoring system like Ganglia would automate this process, they were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, they were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in their Spark application causing the bottleneck.
To overcome these limitations, they developed SparkOScope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, they use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, they extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, you can navigate to any completed application and identify application-logic bottlenecks by inspecting the various plots providing in-depth timeseries for all relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck.
They have made Sparkoscope available as a standalone module, and also extended the available Sinks (mongodb, mysql).
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
- The document profiles Alberto Paro and his experience including a Master's Degree in Computer Science Engineering from Politecnico di Milano, experience as a Big Data Practise Leader at NTTDATA Italia, authoring 4 books on ElasticSearch, and expertise in technologies like Apache Spark, Playframework, Apache Kafka, and MongoDB. He is also an evangelist for the Scala and Scala.JS languages.
The document then provides an overview of data streaming architectures, popular message brokers like Apache Kafka, RabbitMQ, and Apache Pulsar, streaming frameworks including Apache Spark, Apache Flink, and Apache NiFi, and streaming libraries such as Reactive Streams.
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The presentation concludes by providing a reference architecture for connecting Confluent Platform to MongoDB.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
The document discusses preparing data for machine learning by applying data quality techniques in Spark. It introduces concepts of data quality and machine learning data formats. The main part shows how to use User Defined Functions (UDFs) in Spark SQL to automate transforming raw data into the required formats for machine learning, making the process more efficient and reproducible.
Amazon Athena is a serverless query service that allows users to run interactive SQL queries on data stored in Amazon S3 without having to load the data into a database. It uses Presto to allow ANSI SQL queries on data in formats like CSV, JSON, and columnar formats like Parquet and ORC. For a dataset of 1 billion rows of sales data stored in S3, Athena queries performed comparably to a basic Redshift cluster and much faster than loading and querying the data in Redshift, making Athena a cost-effective solution for ad-hoc queries on data in S3.
SMACK is a combination of Spark, Mesos, Akka, Cassandra and Kafka. It is used for pipelined data architecture which is required for the real time data analysis and to integrate all the technology at the right place to efficient data pipeline.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @ShanghaiLuke Han
The document discusses building a data product using Apache Zeppelin (incubating) that sends email notifications about new open source projects on GitHub. It outlines the steps to download data from the GitHub archive, explore and filter the data to focus on interesting companies, join additional data from the GitHub API, generate an HTML template to visualize the results, and schedule sending the email notifications.
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...HostedbyConfluent
Kafka is widely positioned as the proverbial "central nervous system" of the enterprise. In this session, we explore how the central nervous system can be used to build a mesh topology & unified catalog of enterprise wide events, enabling development teams to build event driven architectures faster & better.
The central theme of this topic is also aligned to seeking idioms from API Management, Service Meshes, Workflow management and Service orchestration. We compare how these approaches can be harmonized with Kafka.
We will also touch upon the topic of how this relates to Domain Driven Design, CQRS & other patterns in microservices.
Some potential takeaways for the discerning audience:
1. Opportunities in a platform approach to Event Driven Architecture in the enterprise
2. Adopting a product mindset around Data & Event Streams
3. Seeking harmony with allied enterprise applications
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
This document summarizes lessons learned about monitoring a data pipeline at Hulu. It discusses how the initial monitoring approach had some issues from the perspectives of users and detecting problems. A new approach is proposed using a graph data structure to provide contextual troubleshooting that connects any issues to their impacts on business units and user needs. This approach aims to make troubleshooting easier by querying the relationships between different components and resources. Small independent services would also be easier to create and maintain within this approach.
Taboola's data processing architecture has evolved over time from directly writing to databases to using Apache Spark for scalable real-time processing. Spark allows Taboola to process terabytes of data daily across multiple data centers for real-time recommendations, analytics, and algorithm calibration. Key aspects of Taboola's architecture include using Cassandra for event storage, Spark for distributed computing, Mesos for cluster management, and Zookeeper for coordination across a large Spark cluster.
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
A presentation prepared for Data Stack as a part of their Interview process on July 20.
This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release
This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.
This document discusses Apache Zeppelin, an open-source web-based notebook that enables interactive data analytics. It provides an overview of Zeppelin's history and architecture, including how interpreters and notebook storage are pluggable. The document also outlines Zeppelin's roadmap for improving enterprise support through features like multi-tenancy, impersonation, job management and frontend performance.
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
The document discusses building data pipelines in the cloud. It covers serverless data pipeline patterns using services like BigQuery, Cloud Storage, Cloud Dataflow, and Cloud Pub/Sub. It also compares Cloud Dataflow and Cloud Dataproc for ETL workflows. Key questions around ingestion and ETL are discussed, focusing on volume, variety, velocity and veracity of data. Cloud vendor offerings for streaming and ETL are also compared.
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
During the last year, the team at IBM Research at Ireland has been using Apache Spark to perform analytics on large volumes of sensor data. These applications need to be executed on a daily basis, therefore, it was essential for them to understand Spark resource utilization. They found it cumbersome to manually consume and efficiently inspect the CSV files for the metrics generated at the Spark worker nodes.
Although using an external monitoring system like Ganglia would automate this process, they were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, they were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in their Spark application causing the bottleneck.
To overcome these limitations, they developed SparkOScope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, they use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, they extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, you can navigate to any completed application and identify application-logic bottlenecks by inspecting the various plots providing in-depth timeseries for all relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck.
They have made Sparkoscope available as a standalone module, and also extended the available Sinks (mongodb, mysql).
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
- The document profiles Alberto Paro and his experience including a Master's Degree in Computer Science Engineering from Politecnico di Milano, experience as a Big Data Practise Leader at NTTDATA Italia, authoring 4 books on ElasticSearch, and expertise in technologies like Apache Spark, Playframework, Apache Kafka, and MongoDB. He is also an evangelist for the Scala and Scala.JS languages.
The document then provides an overview of data streaming architectures, popular message brokers like Apache Kafka, RabbitMQ, and Apache Pulsar, streaming frameworks including Apache Spark, Apache Flink, and Apache NiFi, and streaming libraries such as Reactive Streams.
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The presentation concludes by providing a reference architecture for connecting Confluent Platform to MongoDB.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
The document discusses preparing data for machine learning by applying data quality techniques in Spark. It introduces concepts of data quality and machine learning data formats. The main part shows how to use User Defined Functions (UDFs) in Spark SQL to automate transforming raw data into the required formats for machine learning, making the process more efficient and reproducible.
Amazon Athena is a serverless query service that allows users to run interactive SQL queries on data stored in Amazon S3 without having to load the data into a database. It uses Presto to allow ANSI SQL queries on data in formats like CSV, JSON, and columnar formats like Parquet and ORC. For a dataset of 1 billion rows of sales data stored in S3, Athena queries performed comparably to a basic Redshift cluster and much faster than loading and querying the data in Redshift, making Athena a cost-effective solution for ad-hoc queries on data in S3.
SMACK is a combination of Spark, Mesos, Akka, Cassandra and Kafka. It is used for pipelined data architecture which is required for the real time data analysis and to integrate all the technology at the right place to efficient data pipeline.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @ShanghaiLuke Han
The document discusses building a data product using Apache Zeppelin (incubating) that sends email notifications about new open source projects on GitHub. It outlines the steps to download data from the GitHub archive, explore and filter the data to focus on interesting companies, join additional data from the GitHub API, generate an HTML template to visualize the results, and schedule sending the email notifications.
Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby
How to get started with Data Science at scale using Apache Spark to clean, analyze, discover, and build models on large data sets. Use Zeppelin to record analysis to encourage peer review and reuse of analysis techniques.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
The other Apache Technologies your Big Data solution needsgagravarr
The document discusses many Apache projects relevant to big data solutions, including projects for loading and querying data like Pig and Gora, building MapReduce jobs like Avro and Thrift, cloud computing with LibCloud and DeltaCloud, and extracting information from unstructured data with Tika, UIMA, OpenNLP, and cTakes. It also mentions utility projects like Chemistry, JMeter, Commons, and ManifoldCF.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
If You Have The Content, Then Apache Has The Technology!gagravarr
This document provides a summary of 46 Apache content-related projects and 8 content-related incubating projects. It discusses projects for transforming and reading content like Apache PDFBox, POI, and Tika. Projects for text and language analysis like UIMA, OpenNLP, and Mahout. Projects that work with structured data and linked data like Any23, Stanbol, and Jena. Projects for data management and processing on Hadoop like MRQL, DataFu, and Falcon. Projects for serving content like HTTPD Server, TrafficServer, and Tomcat. Projects that focus on generating content like OpenOffice, Forrest, and Abdera. And projects for working with hosted content like Chemistry and ManifoldCF. The document
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
The document discusses several NoSQL database types including key-value stores, document stores, column family stores, and graph databases. It also covers MapReduce, sharding, MongoDB, Cassandra, HBase, BigTable, DynamoDB, Riak, CouchDB, and Neo4j. Additional topics include Hive, Pig, Flume, R, Lucene, Solr, Mahout, ZooKeeper, serialization formats like JSON, BSON, Thrift and Avro.
Architecting an Open Source AI Platform 2018 editionDavid Talby
How to build a scalable AI platform using open source software. The end-to-end architecture covers data integration, interactive queries & visualization, machine learning & deep learning, deploying models to production, and a full 24x7 operations toolset in a high-compliance environment.
AsterixDB is an open source "Big Data Management System" (BDMS) that provides flexible data modeling, efficient query processing, and scalable analytics on large datasets. It uses a native storage layer built on LSM trees with indexing and transaction support. The Asterix Data Model (ADM) supports semistructured data types like records, lists, and bags. Queries are written in the Asterix Query Language (AQL) which supports features like spatial and temporal predicates. AsterixDB is being used for applications like social network analysis, education analytics, and more.
This document discusses enabling Apache Zeppelin and Spark for data science in the enterprise. It outlines current issues with Zeppelin and Spark integration including secure data access, multi-tenancy, and fault tolerance. It then describes how using Livy Server as a session management service solves these issues by providing secure, isolated sessions for each user. The document concludes by covering near term improvements like session management and long term goals like controlled sharing and model deployment.
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
Apache Zeppelin and Spark are turning out to be useful tools in the toolkit of the modern data scientist when working on large scale datasets for machine learning. Zeppelin makes Big Data accessible with minimal effort using web browser based notebooks to interact with data in Hadoop. It enables data scientists to interactively explore and visualize their data and collaborate with others to develop models. Zeppelin has great integration with Apache Spark that delivers many machine learning algorithms out of the box to Zeppelin users as well as providing a fast engine to run custom machine learning on Big Data. The talk will describe the latest in Zeppelin and focus on how it has been made ready for the enterprise. With support for secure Hadoop clusters, LDAP/AD integration, user impersonation and session separation, Zeppelin can now be confidently used in secure and multi-tenant enterprise domains.
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
Apache Zeppelin and Spark are turning out to be useful tools in the toolkit of the modern data scientist when working on large scale datasets for machine learning. Zeppelin makes Big Data accessible with minimal effort using web browser based notebooks to interact with data in Hadoop. It enables data scientists to interactively explore and visualize their data and collaborate with others to develop models. Zeppelin has great integration with Apache Spark that delivers many machine learning algorithms out of the box to Zeppelin users as well as providing a fast engine to run custom machine learning on Big Data. The talk will describe the latest in Zeppelin and focus on how it has been made ready for the enterprise. With support for secure Hadoop clusters, LDAP/AD integration, user impersonation and session separation, Zeppelin can now be confidently used in secure and multi-tenant enterprise domains.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...apidays
Building Green Software: How Cloud-Native Platforms Can Power Sustainable App Development
Katya Dreyer-Oren, Lead Software Engineer at Heroku (Salesforce)
Marissa Jasso, Product Manager at Heroku (Salesforce)
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...apidays
Boost API Development Velocity with Practical AI Tooling
Sumit Amar, VP of Engineering at WEX
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays
Spring Modulith Design for Microservices
Renjith Ramachandran, Senior Solutions Architect at BJS Wholesale Club
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...apidays
What exactly are AI Agents?
Aki Ranin, Head of AI at Earthshots Collective | Deep Tech & AI Investor | 2x Founder | Published Author
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAnass Nabil
AI CHAT BOT Design of a multilingual AI assistant to optimize agricultural practices in Morocco
Delivery service status checking
Mobile architecture + orchestrator LLM + expert agents (RAG, weather,sensors).
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays
Unifying OpenAPI & AsyncAPI: Designing JSON Schemas+Examples for Reuse
Naresh Jain, Co-founder & CEO at Specmatic
Hari Krishnan, Co-founder & CTO at Specmatic
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
At Opsio, we specialize in delivering advanced cloud services that enable businesses to scale, transform, and modernize with confidence. Our core offerings focus on cloud management, digital transformation, and cloud modernization — all designed to help organizations unlock the full potential of their technology infrastructure.We take a client-first approach, blending industry-leading hosted technologies with strategic expertise to create tailored, future-ready solutions. Leveraging AI, automation, and emerging technologies, our services simplify IT operations, enhance agility, and accelerate business outcomes. Whether you're migrating to the cloud or optimizing existing cloud environments, Opsio is your partner in achieving sustainable, measurable success.
SAP Extended Warehouse Management (EWM) is a part of SAP S/4HANA offering advanced warehouse and logistics capabilities. It enables efficient handling of goods movement, storage, and inventory in real-time.
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays
Why an SDK is Needed to Protect APIs from Mobile Apps
Pearce Erensel, Global VP of Sales at Approov Mobile Security
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays
Two tales of API Change Management from my time at Google
Eric Koleda, Developer Advocate at Coda
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays
The FINOS Common Domain Model for Capital Markets
Tom Healey, Founder & Director at FINXIS LLC
Daniel Schwartz, Managing Partner at FT Advisory LLC
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGeTomas Moser
Meet David Bianco, Staff Strategist with Splunk’s elite SURGe team, live in Prague. Get ready for an engaging deep dive into the cutting edge of cybersecurity—straight from the experts driving Splunk’s global security research.
Tableau Cloud - what to consider before making the move update 2025.pdfelinavihriala
Thinking of moving your data infrastructure to the cloud? This presentation will break down the critical things to consider—performance, security, scalability, and those "gotchas" nobody talks about. Think of this as your roadmap to a successful (and smooth!) migration.
2. Software Engineer at NFLabs, Seoul,
South Korea
Co-organizer of SeoulTech Society
Committer and PPMC member of
Apache Zeppelin (Incubating)
@seoul_engineer
github.com/bzz
Alexander
4. CONTEXT
Size of even Public Data is huge and growing
There could be more research, applications and
data products build using that data
Quality and number of free tools available to
public to crunch that data is constantly improving
Cloud brings affordable computations @ scale
19. TOOLS OVERVIEW
Generic:
Grep, Python, Ruby, JVM - all good, but hard to
scale beyond single machine or data format
Hight-performance:
MPI, Hadoop, HPC - awesome but complex, not
easy, problem specific, not very accessible
New, scalable:
Spark, Flink, Zeppelin - easy (not simple) and
robust
21. Apache Software Foundation
1999 - 21 founders of 1 project
2016 - 9 Board of Directors
600 Foundation members
2000+ committers of 171 projects (+55 incubating)
Keywords: meritocracy, community over code, consensus
Provide: infrastructure, legal support, way of building
software
http://www.apache.org/foundation/
22. Apache Spark
Scala, Python, R
Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc. Easy to set up.
Warcbase
Spark library for saved crawl data (WARC)
Juju
Scales, integration with Spark, Zeppelin, AWS, Ganglia
NEW, SCALABLE TOOLS
23. APACHE SPARK
From Berkeley AMP Labs, since 2010
Founded Databricks since 2013, joined
Apache since 2014
1000+ contributors
REPL + Java, Scala, Python, R APIs
http://spark.apache.org
24. APACHE SPARK
Has much more: GraphX, MLlib, SQL
https://spark.apache.org/examples.html
http://spark.apache.org
Parallel collections API (similar to FlumeJava, Crunch, Cascading)
25. • Notebook style GUI on top of backend
processing system
• Plays nicely with all the eco-system Spark,
Flink, SQL, Elasticsearch, etc.
• Easy to set up
APACHE ZEPPELIN (INCUBATING)
http://zeppelin.incubator.apache.org
30. Spark library for WARC (Web ARChive) data processing
* text analysis
* site link structure
WARCBASE
https://github.com/lintool/warcbase
http://lintool.github.io/warcbase-docs
31. Service modeling at scale
Deploymentconfiguration automation
+ Integration with Spark, Zeppelin, Ganglia, etc
+ AWS, GCE, Azure, LXC, etc
JUJU
https://jujucharms.com/