Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
5. Data platform users
5
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
6. 6
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
7. Data platform users
7
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
8. 8
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
11. Anomaly Detection use cases
Security Ops Payment fraud Customer service Accident detection
12. 2. Data Scientist
Responsibility
Extract knowledge and insights from data
(To build better products)
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways
of “cleaning” data
Requirements
Prototype in a language of choice (Python, R, SQL)
Quick and simple ways of cleaning data
24. What is the problem?
● Data preparation - everyone needs it, examples:
○ Write raw data from stream to S3 for batch consumers
○ Filter, aggregate, … the usual ETL stuff
● Enable teams to focus on business problems, don’t worry
about “getting data in”
● Data ingress still is surprisingly difficult
○ Really?
○ Give our users a service that shields them from
infrastructure complexity
25. Dryft
fully managed data processing engine, powering real-time features and events
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11
26. Dryft Program
Configuration file decl_ride_completed.sql
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)
27. Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL
28. Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time
29. When to Dryft
• Feature Generation as original driver
• Declarative Streaming ETL
‒ Stream to Table / Stream
• SQL - Simplicity <> Power tradeoff
‒ Flink SQL supports UDFs (written in Java)
‒ A UDF could also do a service call, but..
30. When we need Programming
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html
31. Flink Streaming Options
• SQL - Dryft
• Java DataStream API - the usual starting point
‒ Sources, Sinks, Windowing, Implicit State Management
‒ Fluent style, high abstraction level
• ProcessFunction for advanced logic
‒ User code controlled state and timers
• Nice fit when Java is already established
‒ Forced language switch is hard sell, time to value long and less predictable
‒ Initial Flink Deployments at Lyft
‒ But we do a lot of stuff in Python..
33. What is the problem?
● Flink API primarily target Java developers
○ Most of our teams that want to solve streaming use
cases don’t work with Java
● Enable streaming native to the language ecosystem
○ Python is the primary option for ML
○ (Use cases not addressed by Dryft/Flink SQL)
34. Streaming Options for Python
• Jython != Python
‒ Flink Python API and few more
• Jep (Java Embedded Python)
• KCL workers, Kafka consumers as standalone services
• Spark PySpark
‒ Not so much streaming, different semantics
‒ Different deployment story
• Faust
‒ Kafka Streams inspired
‒ No out of the box deployment story
35. Apache Beam
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview
37. Python on Flink via Beam
• Beam model and Flink go well along
‒ Flink Runner most advanced OSS option for Beam Java SDK
• Python SDK already available on Dataflow
• Beam Language Portability allows Python (and Go) SDK to work
with JVM-based runners
‒ Flink Runner is first to support portability
• Flink Deployment Story
‒ Extend to run Python via Beam on Flink
38. Python on Flink via Beam
Job Service
Artifact
Staging
Job Manager
Fn Services
SDK Harness /
UDFs
Provision Control Data
Artifact
Retrieval
State Logging
Cluster
Runner
Dependencies
(optional)
python -m apache_beam.examples.wordcount
--input=/etc/profile
--output=/tmp/py-wordcount-direct
--experiments=beam_fn_api
--runner=PortableRunner
--sdk_location=container
--job_endpoint=localhost:8099
--streaming
https://s.apache.org/streaming-python-beam-flink
41. Future of Deployment
• Flink embraces containerization
‒ Reactive vs. Active Flink Container Mode
(resources supplied externally vs. actively requested)
• Kubernetes Operator
‒ Resource Elasticity
‒ Improved Resource Utilization
‒ Auto-Scaling Support
‒ Automate (stateful) upgrade
42. Learnings
• Integration
‒ Things work well in isolation, but..
‒ Flink Kinesis Consumer
‣ Connectors that work reliably at scale are easy hard
• Things we find at scale
‒ Intermittent AWS service errors (Kinesis, S3)
‣ Retry vs. topology reset
‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061)
‒ Naive pubsub consumption can lead to massive state buffering
‣ Align watermarks across source partitions