[Huawei Cloud Technology Sharing] Decryption how to use Shengteng AI computing solution to build a business engine

Posted Jun 15, 20207 min read

Summary: Shengteng AI computing solution is based on ultimate computing power, end-to-end cloud integration, full stack innovation, and open ecological hard core strength. Users can use the standard Matrix interface to implement the business engine and release the Ascension AI acceleration capability.

From the matrix multiplication(GEMM) in convolutional neural networks

Speaking of the AI business, you have to mention the most classic AlexNet. The AlexNet model was proposed in 2012. It is considered to be one of the most influential models in the field of computer vision. The AlexNet network mainly consists of eight layers, the first five layers are convolution layer, the last three layers are fully connected layers. Cooperating with pooling and norm operations, the following lists the parameter scales of all convolutional layers and fully connected layers and the amount of floating-point calculations for each layer. From the figure, it can be seen that the parameter scale of the AlexNet network has reached the order of 60 million, the amount of calculation Reached 720MFlops. By comparing several classic classification networks in the horizontal direction, the parameter scale and the amount of calculation are huge. From the perspective of the amount of calculation, more than 99%of the calculations are convolution operations, which are essentially matrix operations.

Under such a scale parameter and calculation amount, how to accelerate the matrix operation has become an urgent problem to be solved in the field of visual computing. Example:A typical 16*16 two matrix multiplication operation, how is it calculated on different hardware?

Matrix multiplication in the CPU requires 3 for loops, and each bit performs multiply-add operations in sequence. In theory, 16*16*16*2 clock cycles are required.

The matrix multiplication in the GPU has been optimized. The GPU can directly perform vector multiply-add operations. Then the above operations can be disassembled into 16*16 multiply-add operations, which requires 256 clock cycles.

The Shengteng processor provides a special matrix multiplication unit, which can complete a matrix multiplication operation in one clock cycle. With its excellent performance in AI reasoning and ultra-low power consumption, it is used in the risen AI computing solution.

Sentient AI computing solution, new vision in the cloud computing force

The Shengteng AI computing solution uses extreme computing power, end-to-end cloud integration, full-stack innovation, and open ecological hard core strength to help industry customers in AI vision categories such as picture classification, target detection, human detection, face recognition, vehicle detection, etc. Remarkable achievements in the field of computing.

At the IAAS layer, the Ascension AI computing solution can provide Ascension Ai inference examples-including Ai1, KAi1, and bare-metal instances Kat1 that can be used for Ai training.

At the operator layer, the Shengteng AI computing solution can support operators of the mainstream frameworks TensorFlow and Caffe, as well as the ability to customize operators. Based on the operator layer, it also provides a Matrix standardized interface, and users can build a risen business engine based on the Matrix standardized interface.

At the same time, users can also use Huawei Serving to provide RestFull API or gRPC requests to easily decouple services. Together with the AI container service, the upper layer can easily achieve elastic scaling and greatly shorten the business deployment cycle.

How to implement business engine with Matrix interface

Users can use the standard Matrix interface to implement the business engine, and release the Ascension AI acceleration capability through the SDK.

Matrix is a general business process execution engine that runs on the operating system and below the business applications. It can shield operating system differences and provide unified and standardized interfaces for applications, including process orchestration interfaces(supporting C/C++ language and Python language) and model housekeeper interfaces(supporting C++ language).

For a typical business flow, it usually includes data reading, data preprocessing(picture decoding, pre-processing), model inference, data post-processing and other processes.

Then in the Matrix framework, each of the above processes can be abstracted as an engine, and the engine is a computing engine with specific functions. Several engines form a Graph, and Graph is responsible for managing the engine. Matrix is a general business process execution engine that can manage the generation, execution, and destruction of Graph.

Matrix calculation process

Regarding the calculation process of Matrix, we start from the creation process, execution process, and destruction process.

Create process, as shown by the red arrow:

Create a Graph object according to the Graph configuration.

Upload the offline model file and configuration file to the Device side.

Initialize the engine, and the inference engine loads the model through the Init interface of the offline model manager(AIModelManager).

Execution process, as shown by the gray arrow:

Input data

The pre-processing Engine calls the API interface of dvpp for data pre-processing, such as encoding/decoding, matting and scaling of video/images.

The inference engine calls the Process interface of the offline model manager(AIModelManager) to perform inference calculations.

The inference engine calls the SendData interface provided by Matrix to return the inference results to DestEngine. DestEngine returns the inference result to the APP through the callback function.

The destruction process, as indicated by the blue arrow:

End the program and destroy the Graph object.

Double BUFF bonus for Matrix data flow and call flow

Data stream "0" copy

We can see that the transmission performance of data streams in the Matrix framework is crucial.

A separate set of memory allocation and release interfaces is provided in the framework, including HIAI_DMalloc/HIAI_DFree, HIAI_DVPP_DMalloc/HIAI_DVPP_DFree, and supports C/C++ language.

among them,

The HIAI_DMalloc/HIAI_DFree interface is mainly used to apply for memory, and then cooperates with the SendData interface to carry data from the Host side to the Device side;

The HIAI_DVPP_DMalloc/HIAI_DVPP_DFree interface is mainly used to apply for the memory used by the DVPP on the Device side.

By calling the HIAI_DMalloc/HIAI_DFree, HIAI_DVPP_DMalloc/HIAI_DVPP_DFree interface to apply for memory, you can minimize copying and reduce process processing time.

HIAI_Dmalloc has the best performance in the cross-side transmission and model inference stages. The main advantages are:

The applied memory can be used for data handling, which can avoid data copy between Matrix and data transmission module.

The applied memory can directly enable the zero-copy mechanism of model inference to reduce the data copy time.

The HIAI_DVPP_Dmalloc interface is reflected in:

The applied memory can be used by DVPP, and at the same time, it can be transparently transmitted to the model inference after DVPP is used.

If model inference is not required, the data in the applied memory can be directly sent back to the Host side.

User-friendly Host-Device data transmission

In the case of data transmission between Host-Device, using HIAI_REGISTER_SERIALIZE_FUNC to serialize/deserialize custom data types can achieve high-performance data transmission and save transmission time.

Matrix describes the data to be transmitted in the form of "control information + data information". Control information refers to the user-defined data type, and data information refers to the content of the data that needs to be transmitted. To ensure data transmission between Host and Device, Matrix provides the following mechanisms:

Before transferring data, users can call the HIAI_REGISTER_SERIALIZE_FUNC macro to register user-defined data types, user-defined serialization functions, and user-defined deserialization functions.

After the user calls the SendData interface at the local end to send data, Matrix will do the following processing, the processing flow is as follows:

Call the user-defined serialization function to serialize the control information, and put the serialized control information into the memory(ctrlBuf).

Through DMA(Direct Memory Access) mapping, a copy of the control information is stored in the memory of the opposite end, and the mapping relationship between the control information between the local end and the opposite end is maintained. The memory(dataBuf) pointer to the data information has been passed in through the input parameter of the SendData interface. The dataBuf is applied by the user by calling the HIAI_DMalloc/HIAI_DVPP_DMalloc interface. After applying the memory, the system will map the local data information through DMA Make a copy and store it in the memory of the peer end, and maintain the mapping relationship between the data information of the local end and the peer end.

The assembly message is sent to the peer end, and the address and size of ctrlBuf and the address and size of dataBuf are mainly sent to the peer end.

After the peer receives the message, Matrix will call the user-defined deserialization function to parse the control information and data information obtained by the peer, and send the parsed data to the corresponding receiving Engine for processing.

After the opposite end parses the data, the control information has been used, so the memory storing the control information(ctrlBuf) can be released, but because the memory for storing the control message is applied for at the local end, the peer needs to send the local to release the ctrlBuf News.

After receiving the message, the local end releases ctrlBuf.

After the Engine receives the data and completes all the processing, it can release the dataBuf, but since Matrix does not know when the user will finish using the dataBuf, it is required that when the user implements the deserialization function, the dataBuf returns and binds with a smart pointer Destructor hiai::Graph::ReleaseDataBuffer. When the smart pointer ends its lifecycle destructuring, it will automatically call the destructor to send a dataBuf memory release message to the local end.

After receiving the message, the local end releases the dataBuf.

Above, we have introduced in detail how to use the Matrix interface to build a business engine. At the same time, users can also integrate Shengteng Serving to provide standard RestFull API or gRPC requests to provide decoupled standard inference interface capabilities. Or cooperate with AI container services to provide the ability to elastically scale, easy to deploy, and the ability to hot replace models.

Click to follow and learn the latest Huawei cloud technology~