SlideShare a Scribd company logo
Mining Public Datasets
Using Open Source Tools
At scale & on a budget
by Alexander
Software Engineer at NFLabs, Seoul,
South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of
Apache Zeppelin (Incubating)

@seoul_engineer
github.com/bzz
Alexander
IS DATA IMPORTANT?
Content Streaming Services
Taxi
Housing
Web Search
Kuaidi
OpenHouse
IoT
CONTEXT
Size of even Public Data is huge and growing 

There could be more research, applications and
data products build using that data

Quality and number of free tools available to
public to crunch that data is constantly improving 

Cloud brings affordable computations @ scale
PUBLIC DATA = OPPORTUNITY
AGENDA
Datasets
Tools
Scale and Budget
Datasets
Tools
Scale and Budget
• Internet archives

• Web applications logs (wikipedia activity, github activity)

• Genome

• AdClicks

• Webserver access logs 

• Network traffic

• Scientific datasets

• Images, Songs (Million Song Dataset)

• Reviews, 

• Social media

• Flight timetables

• Taxis

• n-gram language model
DATASETS
DATASETS
• 300Gb compressed

• Collaboration google and github engineers

• Events on PR, repo, issues, comments in JSON
http://githubarchive.org
http://www.commitlogsfromlastnight.com/
http://sideeffect.kr/popularconvention/
https://www.gitlive.net/
http://zoom.it/kCsU
Common Crawl
https://commoncrawl.org
Nonprofit, by Factual

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs
URL Index by Ilya Kreymer of @webrecorder_io 

http://index.commoncrawl.org/
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
https://about.commonsearch.org
Datasets
Tools
Scale and Budget
TOOLS OVERVIEW
Generic:
Grep, Python, Ruby, JVM - all good, but hard to
scale beyond single machine or data format

Hight-performance: 

MPI, Hadoop, HPC - awesome but complex, not
easy, problem specific, not very accessible

New, scalable:
Spark, Flink, Zeppelin - easy (not simple) and
robust
Apache Software Foundation
http://www.apache.org/foundation/
Apache Software Foundation
1999 - 21 founders of 1 project

2016 - 9 Board of Directors 

600 Foundation members

2000+ committers of 171 projects (+55 incubating)
Keywords: meritocracy, community over code, consensus
Provide: infrastructure, legal support, way of building
software
http://www.apache.org/foundation/
Apache Spark
Scala, Python, R

Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc. Easy to set up.

Warcbase
Spark library for saved crawl data (WARC)

Juju
Scales, integration with Spark, Zeppelin, AWS, Ganglia

NEW, SCALABLE TOOLS
APACHE SPARK
From Berkeley AMP Labs, since 2010

Founded Databricks since 2013, joined
Apache since 2014

1000+ contributors 

REPL + Java, Scala, Python, R APIs
http://spark.apache.org
APACHE SPARK
Has much more: GraphX, MLlib, SQL

https://spark.apache.org/examples.html
http://spark.apache.org
Parallel collections API (similar to FlumeJava, Crunch, Cascading)
• Notebook style GUI on top of backend
processing system

• Plays nicely with all the eco-system Spark,
Flink, SQL, Elasticsearch, etc. 

• Easy to set up
APACHE ZEPPELIN (INCUBATING)
http://zeppelin.incubator.apache.org
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Enters ASF Incubation12.2014
08.2013 NFLabs Internal project Hive/Shark
12.2012 Commercial App using AMP Lab Shark 0.5
10.2013 Prototype Hive/Shark
APACHE ZEPPELIN PROJECT TIMELINE
01.2016 3 major releases
http://zeppelin.incubator.apache.org
Interactive Visualization
APACHE ZEPPELIN
Pluggable Interpreters
APACHE ZEPPELIN
Spark library for WARC (Web ARChive) data processing

* text analysis

* site link structure

WARCBASE
https://github.com/lintool/warcbase
http://lintool.github.io/warcbase-docs
Service modeling at scale

Deploymentconfiguration automation

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

JUJU
https://jujucharms.com/
$ apt-get install juju-core juju-quickstart
# or
$ brew install juju juju-quickstart
$ juju generate-config
#LXC, AWS, GCE, Azure, VMWare, OpenStack
$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU
http://bigdata.juju.solutions/getstarted
JUJU
http://bigdata.juju.solutions/getstarted
7 node cluster designed to scale out
Datasets
Tools
Scale and Budget
1 core
10s PC
1000 instances
APPROACH: SCALE AND BUDGET
Prototype 

Estimate the cost

Scale out
Your laptop

AWS spot instances

Deployment automation
TAKEAWAY
There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough
Thank you
Questions?
@seoul_engineer
Alexander
github.com/bzz

More Related Content

What's hot (20)

Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
HostedbyConfluent
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
Progress
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Lambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
10 Things About Spark
10 Things About Spark
Roger Brinkley
 
Big Data Platform at Pinterest
Big Data Platform at Pinterest
Qubole
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
HostedbyConfluent
 
Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data Factory
Riccardo Perico
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Databricks
 
Beyond Relational
Beyond Relational
Lynn Langit
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Data streaming
Data streaming
Alberto Paro
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
HostedbyConfluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
An overview of Amazon Athena
An overview of Amazon Athena
Julien SIMON
 
Kick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
HostedbyConfluent
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
Progress
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Lambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
10 Things About Spark
10 Things About Spark
Roger Brinkley
 
Big Data Platform at Pinterest
Big Data Platform at Pinterest
Qubole
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
HostedbyConfluent
 
Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data Factory
Riccardo Perico
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Databricks
 
Beyond Relational
Beyond Relational
Lynn Langit
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
HostedbyConfluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
An overview of Amazon Athena
An overview of Amazon Athena
Julien SIMON
 
Kick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 

Similar to Mining public datasets using opensource tools: Zeppelin, Spark and Juju (20)

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
04 open source_tools
04 open source_tools
Marco Quartulli
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
gagravarr
 
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
Geoffrey Fox
 
The ABC of Big Data
The ABC of Big Data
André Faria Gomes
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
jins0618
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
gagravarr
 
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
Geoffrey Fox
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
jins0618
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Ad

Recently uploaded (20)

LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...
apidays
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays
 
Ch01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_Securit
KawukiDerrick
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...
apidays New York 2025 - Building Green Software by Marissa Jasso & Katya Drey...
apidays
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays
 
Ch01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_Securit
KawukiDerrick
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
Ad

Mining public datasets using opensource tools: Zeppelin, Spark and Juju