More and more major companies are building their own data pipeline to better manage data. To recall, a data pipeline is the software that consolidates data from multiple sources (often billions or trillions of different events per day) and allocates those jobs strategically, at specified intervals. This data typically powers internal analytics and product features.
Companies are building their own data pipelines from scratch that would best fit their needs for data allocation. Mdata practitioners are realizing there are huge gaps between different data infrastructure components, and they are trying to fill these gaps. Here, we will briefly discuss several major company solutions approach their data allocation issues.
Back in 2011, Yelp needed a solution to scale out to different markets for their business reviews services. They previously relied on older legacy IT solutions along a giant monolithic app (referred to as “yelp-main”) with millions of lines of code, but they found response times far slower to fit consumer needs. During the overhaul, embarked upon a major IT engineering effort that would see that single monolithic app broke up into 150 individual production services. This migration to a services oriented architecture (SOA), ostensibly, would make it easier for the company to serve new markets by scaling the services individually. It would also reduce the amount of time required to get Yelp user insights into the hands of the sales team.
While the SOA had improved developer activity, Yelp data engineers still faced the problem of communicating between their various services, particularly if done using RESTful HTTP connections. In turn, Yelp built a real-time streaming data platform simply dubbed Data Pipeline. They built a unified system for producer and consumer applications to stream information between each other efficiently and scalably. It does this by connecting applications via a common message bus and a standardized message format in order to stream database changes and log events into any service or system that needs them.
Yelp has the following aspects of Data Pipeline as open sourced:
- MySQL Streamer tails MySQL binlogs for table changes. The Streamer is responsible for capturing individual database changes, enveloping those changes into Kafka messages (with an updated schema if necessary), and publishing to Kafka topics.
- The Schematizer service tracks the schemas associated with each message. When a new schema is encountered, the Schematizer handles registration and generates migration plans for downstream tables.
- The Data Pipeline clientlib for producing and consuming Kafka messages.
- The Data Pipeline Avro utility package provides a Pythonic interface for reading and writing Avro schemas and also provides an enum class for metadata like primary key info.
- The Yelp Kafka library extends the kafka-python package to provide features such as a multiprocessing consumer group.
Netflix uses a combination of Kafka, Samza, Docker, and Linux to implement a petabyte-scale real-time event stream system dubbed Keystone that processes 700B events/day in the Amazon AWS cloud. Three major components of the pipeline that Netflix now uses can be broken down as follows:
- A Chukwa to S3 based data ingest for later processing by Elastic Mapreduce (EMR).
- The same pipeline with a Kafka fronted branch supporting the S3 to EMR flow, as well as real time stream processing with tools like Druid and Elasticsearch via a new routing service.
- RESTful Kafka implementation streaming to a newer version of the routing service managed with a control plane, and then ingested by a consumer Kafka, Elasticsearch, or other stream consumers like Mantis or Spark.
Netflix Keystone contains multiple Kafka producers, with each producing to a Fronting Kafka cluster for sink level isolation. Two of the main types of clusters are referred to as Fronting Kafka and Consumer Kafka. Fronting Kafka clusters are responsible for getting the messages from the producers which are virtually every application instance in Netflix. Their roles are data collection and buffering for downstream systems. Consumer Kafka clusters contain a subset of topics routed by Samza for real-time consumers. This led to setting up one topic per event type/stream for providing isolation and buffering against downstream sinks that could possibly impact upstream publishing services.
The new branch exposed 30 percent of their event traffic in real-time to Elasticsearch, other Kafka streams for transformation, or to Spark for general data processing needs. This enabled interactive data analysis in Python, Scala, and R in real time.
The routing service consists of multiple Docker containers running Samza jobs across EC2 instances. Separate EC2 fleets managed S3, Elasticsearch, and Kafka routes and generated container level performance monitoring data. Statistical summaries for latencies, lags and sinks were also generated for profiling the various parts of the pipeline.
Seatgeek uses a variety of sources such as ElasticSearch, MySQL, Redis, and S3 as main production storages along with Google Analytics and various APIs for external storage. Yet in order to approach the ways of processing and integrating data across all applications, the company looks to using three components for their pipeline:
- Looker: Business intelligence service that hooks on top of the datastore with its own data-modeling language to provide a visually appealing frontend layer to the data. Can connect to any relational database such as Amazon Redshift or Google inquiry and automatically generates a data model from the schema that visually provides metrics in relation to SeatGeek’s business calculations.
- Redshift: Amazon’s cloud-based analytical datastore, a columnar datastore based on PostgreSQL. A main benefit of columnar datastores is that column-stored data is far more optimized for the ‘many rows, few columns’ summary queries that analysts are interested in running than is row-stored data.
- Luigi: An open source Python framework created by Spotify for generating recommendations and top lists.
Pinterest has recently designed a real-time data pipeline to ingest data into MemSQL using Spark Streaming to help achieve higher performance event logging, reliable log transport and storage, and faster query execution on real-time data. They have also open sourced a few of their projects for users:
- Pintrace: A Spark job that reads spans from Kafka, aggregates them into traces and stores them in an Elasticsearch backend.
- Pinball: Approaches a wide range of data processing pipelines composed of jobs ranging from simple shell scripts to elaborate Hadoop workloads.
- Secor: Persists Kafka logs to long-term storage such as Amazon S3. It’s not affected by S3’s weak eventual consistency model, incurs no data loss, scales horizontally, and optionally partitions data based on date.