"Demystifying GP" Greenplum Database enters the field of deep learning
Posted Jun 5, 2020 • 5 min read
Deep learning has become a more important part of enterprise computing, because artificial neural networks are very effective in areas such as natural language processing, image recognition, fraud detection, and recommendation systems. In the past five to ten years, the computing power of computers has been greatly enhanced, and the emergence of massive amounts of data has prompted interest in solving problems using deep learning algorithms.
On the other hand, most of the business systems of enterprises are based on SQL infrastructure, and a lot of investment is made in software and employee training. However, the main innovation of deep learning occurs outside the SQL world, so companies using deep learning algorithms need to use an independent deep learning infrastructure. Therefore, to build a deep learning system outside the traditional SQL architecture, not only need to consider the additional cost and workload, but also need to consider the risk of developing new data islands. In addition, moving large data sets between systems is not efficient. If companies can use popular deep learning frameworks(such as Keras and TensorFlow) in MPP relational databases to execute deep learning algorithms, then this will enable companies to leverage their existing investments in SQL, making deep learning easier and more approachable .
In addition, another consideration is the need to apply multiple models in many data science problems today. In general, data scientists often spend a lot of time in analyzing data feature engineering to use multiple methods to solve problems. In this case, the result of data analysis is usually a combination of multiple models. In this case, it is more efficient to use the same calculation engine for all calculations than to use separate systems to calculate separately and then combine the results. To this end, a set of machine learning and analysis functions are built inside the database to enable these calculation databases to be executed, which reduces or even eliminates data movement between different computing environments and greatly improves the calculation efficiency.
Using GPU accelerated deep learning algorithms on Greenplum
The following figure(Figure 1) is the architecture diagram of Greenplum+GPU. Standard deep learning algorithm libraries such as Keras and TensorFlow are deployed on Greenplum s segment nodes, and GPUs are also deployed on segment nodes. Segments on each node share GPU computing resources.
Figure 1:Greenplum architecture for deep learning
The purpose of this architectural design is to eliminate the transmission delay of the interconnection between the segment and the GPU. In this architecture, each segment only needs to process the local data to get the result. The open source machine learning library Apache MADlib integrated with Greenplum is responsible for merging the models of each segment to get the final model. This calculation method utilizes the horizontal and horizontal expansion function of MPP.
Programming with MADlib is very simple, just call the Apache MADlib function in SQL. In the following example, we use the algorithm provided by MADlib to train the model on the CIFAR-10 image dataset. The specific SQL is as follows:
At the end of this SQL run, the trained model is stored in the table model_arch_library, and the data in model_arch_library is the JSON representation of the convolutional neural network(CNN) of the training model. CNN is a special neural network, very good at image classification. In the above SQL, there is a useful parameter-GPU per host(the number of GPUs per node), this parameter specifies the number of GPUs used to train the model on each segmeng node. Specifying the parameter 0 means using the CPU instead of the GPU for training, so that you can use shallow neural networks to debug and test run on smaller data sets. The trial run can even be on PostgreSQL. After the trial operation is passed, it can be transferred to a Greenplum cluster equipped with an expensive GPU to use the entire data set to train a deep neural network.
After the model training is completed, we can use the model trained above for image classification. The specific SQL is as follows:
Performance and Scalability
Modern GPUs have high memory bandwidth and the number of processing units per chip is 200 times that of the CPU. This is because they are optimized for parallel data calculations such as matrix operations. The CPU is designed to be more versatile to perform more types of tasks. Therefore, the performance improvement brought by using GPU to train deep neural networks is well known. Figure 2 shows the performance improvement of a simple deep CNN between a conventional CPU Greenplum cluster and a GPU accelerated Greenplum cluster. In this test, we used a smaller Greenplum cluster(with 4 segments) to test the training time of the CIFAR-10 dataset. The results are shown in the following figure:
Figure 2:Greenplum Database GPU and CPU training performance*
The Greenplum cluster requires more than 30 minutes of training time to achieve 75%accuracy on the test set using only the CPU, while using the GPU to accelerate the Greenplum cluster to achieve 75%accuracy in less than 15 minutes. The CIFAR-10 image resolution is only 32×32 RGB, so the GPU performance improvement is lower than high-resolution images. For the Places dataset with 256×256 RGB images, we found that using the GPU to accelerate the training model is 6 times faster than using the CPU only.
The key benefit of using GPU to accelerate model training is to reduce the training time of the model, which means that data scientists can iterate the model training faster and deploy the newly trained model to the production environment faster. For example, in the case of fraud detection, the immediate benefit of faster new model training and deployment is the reduction of financial losses.
Inference means to use the training model to classify new data. Using MPP databases like Greenplum is very suitable for batch processing; throughput increases linearly with the size of the database cluster. For example, using the CNN model we trained above, Table 1 shows the time required to perform batch inference on 50,000 new 32x32RGB images.
Table 1:Batch classification test results on Greenplum database cluster
As part of the Apache MADlib project, the MADlib community plans to gradually add new deep learning features in each version. For example, a common data science workflow is the selection and adjustment of parameters. The adjustment of parameters includes not only the adjustment of parameters of the model, but also the adjustment of the structure of the model, such as the number and composition of network layers. These generally involve training a combination of dozens to hundreds of models in order to find the combination with the best accuracy/training cost profile. Under such a large training pressure, the use of MPP databases like Greenplum can greatly improve the efficiency of model training with the help of parallel computing functions.
Test environment used in this article
Basic platform:Google Cloud Platform
Greenplum version:Greenplum 5
Segment node configuration:32 core vCPUs, 150 GB memory
Segment node GPU configuration:NVIDIA Tesla P100 GPU
Le Cun, Denker, Henderson, Howard, Hubbard and Jackel, Handwritten digit recognition with a back-propagation network, in:Proceedings of the Advances in Neural Information Processing Systems(NIPS), 1989, pp. 396 404 .