Heartfelt! Write code every day, is the direction really right?
Posted May 27, 2020 • 7 min read
"Everyone's time is limited. It will become particularly important to choose a technology worth investing in a limited time." I have been working since 2008 for 12 years now, and I have been working with the data all the way. Dealed with, worked on the development of many big data underlying framework cores(Hadoop, Pig, Hive, Tez, Spark), and also worked on upper data computing frameworks(Livy, Zeppelin) and data application development for many years, including data processing, data analysis and machines Learn. It is now an Apache Member and PMC for multiple Apache projects. Joined Alibaba's real-time computing team in 2018 to focus on R & D of Flink. Today I want to combine my past professional experience to talk about how to assess whether a technology is worth learning. I have been in the big data circle, from the initial Hadoop to the later Hadoop ecological projects Pig, Hive, Tez, and then to the new generation of computing engine Spark, and then to the Flink that I am doing recently, the big data computing engine runs through my entire career. Personally, I was lucky. I was doing hot technology at every stage. At that time, I chose the type of technology based on my own interest and intuition. Looking back now, I think it is necessary to evaluate whether a technology is worth learning from the following three latitudes. 1. Technical depth 2. Ecological breadth 3. Evolutionary ability
01 Technical depth Technical depth refers to whether the foundation of this technology is solid, whether the moat is wide and deep enough, and whether it is easily replaced by other technologies. In layman's terms, whether this technology solves important problems that other technologies cannot. There are two main points here:1. No one can solve this problem. It is this technology that first solved the problem. 2. Solving this problem can bring significant value. Take the Hadoop I studied at the beginning of my career as an example. At that time, Hadoop was a revolutionary technology when it first came out, because apart from Google claiming that it had a GFS and MapReduce system inside, no other company in the industry had a complete mass data solution. With the development of Internet technology, the amount of data is increasing day by day, and the ability to process massive amounts of data is imminent. The birth of Hadoop just solved this urgent need. With the development of technology, the advantages of Hadoop's ability to process massive data are gradually used to people. On the contrary, the defects of Hadoop are constantly criticized(poor performance, complicated MapReduce writing, etc.). At this time, Spark came into being and solved the stubborn illness of the Hadoop MapReduce computing engine. Spark surpassed the computing performance of Hadoop and the extremely elegant and simple API catered to the needs of users at that time, and was popular with large data engineers. Now I am engaged in the research and development of Flink in Alibaba. The main reason is that I have seen the demand for real-time in the industry and Flink's dominance in real-time computing. The biggest challenge encountered by big data before was the large scale of data(so everyone will call it "big data"). After years of hard work and practice in the industry, the problem of large scale has basically been solved. In the next few years, the greater challenge is speed, which is real-time. The real-time nature of big data does not refer to the real-time nature of simply transmitting data or processing data, but the real-time from end to end. Any one step slows down, affecting the real-time nature of the entire big data system. In Flink's opinion, Everything is stream. Flink's Stream-centric architecture is unique in the industry. The resulting superior performance, high scalability, and end-to-end Exactly Once characteristics make Flink a well-deserved king in the field of stream computing. There are currently three mainstream stream computing engines:Flink, Storm, and SparkStreaming.
Note:Spark Streaming can only choose search terms, in theory such a comparison is not rigorous. But as a trend, we are more concerned about its change curve, the actual impact should not be large. As can be seen from the Google trends curve above, Flink is in a period of rapid growth, Storm's popularity is declining year by year, and Spark Streaming has almost entered the platform period. This proves that Flink has a deep foundation in the field of flow computing. At present, no one can surpass Flink's dominant position in the field of flow computing. 02 Ecological breadth It is not enough for a technology to have only technical depth, because a technology can only focus on doing one thing well. If you want to solve the complex problems in real life, you must integrate with other technologies, which requires this. Technology has a sufficiently wide ecological breadth. The breadth of ecology can be measured in two latitudes:1. Upstream and downstream ecology. Upstream and downstream ecology refers to data upstream and downstream from the perspective of data flow. 2. Vertical field ecology. Vertical domain ecology refers to the integration of a specific segment or application scenario.
When Hadoop first came out, there were only two basic components:HDFS and MapReduce, which solved the problems of mass storage and distributed computing, respectively. However, with the development, the problems that need to be solved are becoming more and more complex. HDFS and MapReduce can no longer easily solve some complex problems. At this time, other ecological projects of Hadoop have emerged, such as Pig, Hive, HBase, etc. This perspective solves the problem that Hadoop is not easy or impossible to solve. The same is true of Spark. At the beginning, Spark was to replace the original MapReduce computing engine. Later, Spark developed various language interfaces and various upper-level frameworks, such as Spark SQL, Spark Structured Streaming, MLlib, GraphX, etc., which greatly enriched Spark The use of the scene has expanded Spark's vertical domain ecology. Spark's support for various Data Sources also makes Spark, the computing engine and storage, form an alliance, establish a strong upstream and downstream ecosystem, and lay the foundation for end-to-end solutions. The ecology of the Flink project I am working on is still in its infancy. When I joined Alibaba, I was not only seeing the dominance of Flink as a stream computing engine, but also because I saw the opportunity of the Flink ecology. If you look at my career, I will find some changes. I have been focusing on the core framework layer of big data from the beginning. I am slowly moving towards the surrounding ecological projects. A main reason is my judgment on the entire big data industry:the first half of the big data battle is concentrated on the underlying framework, which is now nearing the end, and there will not be so many new technologies and frameworks in the underlying big data ecosystem In each sub-sector, the survival of the fittest will be achieved, and it will become more mature and more centralized. The focus of the second half of the battle is from the bottom to the top, to the ecology. Previous big data innovations are more biased towards IAAS and PAAS, and in the future you will see more SAAS-type big data products and innovations.
Every time I talk about the ecology of big data, I come up with the picture above. This picture basically includes the big data scenarios you need to deal with every day. From the leftmost data producer, to data collection, data processing, and then to data application(BI + AI). You will find that Flink can be applied in every step. Not only involves big data, but also AI, but Flink's strength lies in stream computing processing. The ecology in other fields is still in its infancy. My personal job is to improve Flink's end-to-end capabilities on the above picture. 03 Evolutionary ability If the technical depth and ecological breadth of a technology are no problem, then at least it shows that this technology is worth learning at the moment. But investing in a technology needs to be considered in terms of time. You definitely do n t want the technology you ve learned to be eliminated soon, and learn a new technology every year. Therefore, a technology worth investing in and learning must have a lasting evolutionary ability. I first learned Hadoop for more than 10 years, and it is still widely used. Although there are many public cloud vendors now seizing the Hadoop market, you have to admit that if a company wants to set up a big data department, the first thing is probably to build a Hadoop cluster. When we talk about Hadoop now, he is not the original Hadoop anymore, he is more a general term for the Hadoop ecosystem. You can take a look at this article by Cloudera CPO Arun , I agree with them. :[ https://medium.com/@acmurthy/hadoop-is-dead-long-live-hadoop-f22069b264ac] ( https://medium.com/@acmurthy/hadoop-is-dead-long- live-hadoop-f22069b264ac) The Spark project goes without saying. Spark broke out in 14 and 15 years and has now entered a plateau. But Spark is still evolving and still embracing change. Spark on K8s is the best proof that Spark embraces cloud native. Now that the Delta is the hottest in the Spark community, MLFlow is even more proof of Spark's powerful evolutionary ability. Today's Spark is not only the Spark that will replace MapReduce, but also a general-purpose computing engine suitable for multiple scenarios. I joined Alibaba in 18 years and it has been almost a year and a half now. In this year and a half, I just witnessed Flink's ability to evolve. First of all, after several major versions of Flink have been released, most of the functions of Blink have been integrated, which has greatly improved the capabilities of Flink SQL. Secondly, Flink's support for K8s, Python, and AI are proving to people that this Flink's own powerful evolutionary ability. Tips In addition to the above three dimensions, I would like to share some tips when evaluating a new technology. 1. Use Google trends. Google trends can well reflect the development momentum of a technology. The trend chart mentioned above compares the three major stream computing engines Flink, Spark Streaming and Storm. It is not difficult to conclude that Flink is the king of stream computing. . 2. Check out awesome on GitHub. A popular indicator of a technology is the awesome list on GitHub. You can check the GitHub star number of this awesome list. In addition, you can take a weekend to look at the content on this awesome list, because the above is basically the essence of this technology, through which you can roughly judge the value of this technology. 3. See if there are some technical preachers on the technology website endorsing this technology(I personally often watch medium.com). There is usually a group of people in the technology circle who are very persistent and tasteful in technology. If a technology is really good, then technical preachers will endorse this technology for free and share how to use this technology. 04 Summary Everyone s time is limited, and choosing a technology worth investing in a limited time becomes particularly important. The above are some of my thoughts on how to assess whether a technology is worth learning. It is also a small summary and review of my own career in terms of technology selection. I hope that these thoughts can make a difference for everyone's career. help. Author introduction:Zhang Jianfeng(Jian Feng), a veteran of open source, Github ID:@zjffdu, Apache Member, worked in Hortonworks, is currently a senior technical expert in the Alibaba Computing Platform Division, and also serves as Apache Tez, Livy, Zeppelin PMC of an open source project, and the Committer of Apache Pig. Fortunately, I have been exposed to big data and open source very early, hoping to make some contributions to big data and data science in the field of open source.