Hundreds of millions of data, how to query and analyze simple and efficient?

Posted Jun 28, 20205 min read

Summary: At the time of the 618 promotion, Xiao Zhang encountered a difficult problem and needed to conduct a joint analysis of the company s e-commerce department s revenue and offline store operating data within a week.

What data problems will this create?

  • Data silos:the data of the e-commerce department exists in data warehouse A, and the store operating income data exists in data warehouse B. How to carry out multi-storage joint analysis conveniently?
  • PB-level data volume:Multiple e-commerce platforms + offline stores across the country will generate TB-level data volume every day, with annual data volume up to PB-level!

He contacted the group CTO as soon as possible, hoping to export the data of various departments to him within a day.

At this time, the CTO is in trouble:

The company's existing resource pool can easily handle TB-level data volume, and the data volume of Xiao Zhang Yao's data is roughly estimated to reach PB level, which is far beyond the company's existing resource pool. It can only be exported at the cost of time; Common scenarios expand the company's resource pool, and the overall cost is too high.

Facing the difficult problems encountered by Xiao Zhang, Yunhu Lake recommended a Huawei Cloud Big Data Query and Analysis Artifact-Data Lake Discovery(DLI) service; a DLI can leverage EB-level data volume joint query, each CU only Need 0.35 yuan/hour(1CU=1Core4G Mem), 1CU monthly subscription is only 150 yuan.

Data Lake Discovery(DLI) Service 2.0 is a serverless big data computing and analysis service that is fully compatible with the Apache Spark and Apache Flink ecosystems. Users only need to use standard SQL or programs to query and analyze various heterogeneous data sources.

How does DLI solve Xiao Zhang's problem?

DLI Service Architecture-Serverless

DLI is a serverless big data query and analysis service. Its advantages are:

(1) Billing based on volume:true billing based on usage(scanning volume/CU time), no charge when running jobs.

(2) Automatic expansion and reduction:According to the business load, the computing resources are estimated and automatically expanded and reduced.

DLI Serverless architecture can easily solve the problem of small cost, insufficient resources and temporary business needs.

1, DLI core engine-Spark+Flink

Spark is a unified analysis engine for large-scale data processing, focusing on query calculation analysis. DLI performs a lot of performance optimization and service transformation on the basis of open source Spark. It is not only compatible with the Apache Spark ecosystem and interfaces, but its performance is 2.5 times higher than that of open source. EB-level data query and analysis can be achieved at the hour level. At the same time, DLI also provides Flink engine for real-time processing.

2, DLI ace function-cross-source analysis

DLI supports multiple cloud services on the cloud, self-built databases, and offline databases. It can directly implement cross-database analysis of multiple data sources and build a unified view of the enterprise.

Xiaozhang connects offline data warehouse A and data warehouse B to DLI at the same time, and can directly conduct joint inquiry on DLI. It avoids the process of data migration between two warehouses and re-building warehouses for joint query, and it is easy to handle cross-database queries.

Other advantages of Data Lake Discovery(DLI) service

  1. Pure SQL operation:Provide standard SQL interface, users only need to use SQL to achieve massive data query and analysis.
  2. Separation of storage and calculation:decoupling storage and calculation, separate application and billing, while reducing costs and improving resource utilization.
  3. Enterprise-level multi-tenancy:support computing resources isolated by tenant, control data permissions to queues and jobs, and help enterprises achieve data sharing and permission management between departments
  4. Free operation and maintenance, high availability:users do not need to perceive the underlying operation and maintenance, upgrade, cross-AZ high availability, cross-AZ active.

Application scenarios of Data Lake Discovery(DLI) service

  1. Database analysis + DLI 2.0:one-click to build warehouses to keep the database easy to use experience

Pain points:

(1) Many databases cannot be analyzed in full

(2) The complex relationship of the database cannot be queried

(3) Affect other online data services


Only use standard SQL to complete big data query analysis

  1. Precise marketing + DLI 2.0:E-commerce intelligent recommendation Cross-database and cross-source massive data seconds query

Pain points:

(1) How to jointly analyze too many data sources

(2) Smart recommendation needs to be implemented in a short time


DLI cross-origin capability easily breaks data silos. Now supports 10 types of data sources and offline self-built data.

  1. Log analysis + DLI 2.0:the company's essential scenarios

Pain points:

(1) The log analysis time span is large

(2) Low resource utilization and low utilization rate


DLI is billed by volume, and a single CU costs only 0.35 yuan per hour.

  1. Real-time risk control + DLI 2.0:real-time scenarios such as finance and operation and maintenance reduce risk events

Pain points:

(1) Data refresh is not timely and risk events occur frequently

(2) Need to deeply understand the Flink background architecture for real-time data analysis


The risk control system requires high real-time performance. DLI uses high-performance computing resources, and a single CPU can process 1,000 to 20,000 messages per second.

Serverless big data service is a form facing the future. As the current problems are solved one by one, its share in big data analysis will definitely increase year by year. Really turn big data analysis into a tool that can be used by every enterprise, just like water and electricity. Huawei Cloud Data Lake Discovery(DLI) service can help enterprises easily complete batch processing and stream processing of heterogeneous data sources, and explore and explore the value of data.

For more information, You can log in to the Huawei Cloud Data Lake Discovery(DLI) Service Officer

Click to follow and learn the latest Huawei cloud technology~