基于Kafka和Spark的实时数据质量监控平台.PPT
文本预览下载声明
版本 Kafka under audit: 0.8.1.1 Audit pipeline: Kafka 0.8.1.1 Spark 1.6.1 ElasticSearch 1.7.0 To be open sourced! 团队 微信 LinkedIn 欢迎加入我们 * * 数据质量监控中我们要解决什么问题? * * * SLA QPS Scale.. * Late arrival, out of order processing, duplication * SLA QPS Scale.. * * 基于Kafka和Spark的实时数据质量监控平台 邢国东 资深产品经理@ Microsoft 微信 LinkedIn 改变中的微软 微软应用与服务集团(ASG) Microsoft Application and Service Group ASG数据团队 大数据平台 数据分析 我们要解决什么问题 Kafka as data bus Devices Services Streaming Processing Batch Processing Applications Scalable pub/sub for NRT data streams Interactive analytics 数据流 快速增长的实时数据 1.3 million EVENTS PER SECOND INGRESS AT PEAK ~1 trillion EVENTS PER DAY PROCESSED AT PEAK 3.5 petabytes PROCESSED PER DAY 100 thousand UNIQUE DEVICES AND MACHINES 1,300 PRODUCTION KAFKA BROKERS 1 Sec 99th PERCENTILE LATENCY Kafka上下游的数据质量保证 Producer Kafka HLC Destination Destination Producer Producer Producer Producer Producer Producer Producer Producer Kafka HLC Kafka HLC 100K QPS, 300 Gb per hour Data == Money Lost Data == Lost Money 工作原理简介 工作原理 3 个审计粒度 文件层级(file) 批次层级(batch) 记录层级 (record level) Metadata { “Action” : “Produced or Uploaded”, “ActionTimeStamp” : “action date and time (UTC)”, “Environment” : “environment (cluster) name”, “Machine” : “computer name”, “StreamID” : “type of data (sheeps, ducks, etc.)”, “SourceID” : “e.g. file name”, “BatchID” : “a hash of data in this batch”, “NumBytes” : “size in bytes”, “NumRecords” : “number of records in the batch”, “DestinationID” : “destination ID” } 工作原理 – 数据与审计流 Audit system Kafka + HLC under audit Destination 1 Producer File 1: Produced: file 1: 3 records Record1 Record2 Record3 Uploaded: file 1: 3 records Record4 Record5 Produced 24 bytes 3 records Timestamp “File 1” BatchID=abc123 Produced 40 bytes 5 records Timestamp “File 1” BatchID=def456 Produced: file 1: 5 records Uploaded 24 bytes 3 records Timestamp BatchID Destination 1 Producer Data Center 数据时延的Kibana图表 数据完整性Kibana图表 3 lines Green how many records produced Blue: how many reache
显示全部