To some groups, cloud-native may seem to be something high-tech and trendy. When it comes to the discussion of cloud-native, a host of other beautiful words may follow: resilient, observable, robust, sustainable, etc. While back to practice, it is not an overnight success to perform like that.
Cloud-native transformation is an important initiative that concerns the long-term development of an enterprise. Zuoyebang’s fundamental technical architecture has undergone a successful transformation, and the challenges faced in the process are not that simple. So, what experiences can we learn from?
We invited Dong Xiaocong, head of the architects in Zuoyebang, to share his thinking and exploration on the road to multi-cloud. By doing this, we hope developers and managers who are about to or are in the process of digital transformation will get inspired.
Q: What made Zuoyebang choose cloud-native?
A: I joined Zuoyebang in 2019 and found two characteristics of the underlying technical architecture at that time.
First, scale. There are thousands of application services online, and many of these correspond to a large number of service instances, which run on top of hundreds of thousands of computing cores. The second is complexity. The overall technology stack of Zuoyebang is relatively diverse. The technology stacks that account for the highest percentage are Golang and PHP, and a large number of modules are written in C++, Python, Java, etc.
In addition, different business features and team characteristics vary greatly. Traffic products and technology stacks tend to be conservative, while the business architecture of the industrial Internet is domain-driven with more microservice architectures.
Zuoyebang also faces many challenges in terms of stability, efficiency, and cost.
When it comes to stability, we seldom contacted users in traditional Internet companies, and our perception of users was more about UV and PV numbers. But online education is different, we and students are face-to-face through live streaming or in other forms, and every accident of stability may affect their studies and cause irreparable losses. Therefore, the requirements of Zuoyebang for stability can only be higher. When there is a single machine, single machine group, or single cloud failure? Can we quickly stop the damage when code changes cause business interruptions?
Then there is the issue of efficiency. Because the offline and online deliverables are different (e.g., offline is a container, but online is a virtual machine), the environments on both sides are also heterogeneous, which leads to an exponential increase in the cycle and cost of R&D, operation, and testing.
Once there are network jitters or service failures, we always need to coordinate with all parties constantly, waiting for R & D, operations, and other cloud providers to restore service, thus causing users a very bad experience.
Another large part is also due to IT spending considerations, which means a combination of business considerations, and negotiations with multiple vendors.
In summary, based on stability, cost, efficiency, and other issues, Zuoyebang chooses cloud-native and multi-cloud.
The overall benefits of the transformation are rather obvious. First of all, the stability and the overall impact of machine failure have been shortened from the minute level to the second level, and the quality of delivery and deployment has been substantially improved. There are also obvious gains in the cost area.
Q: Zuoyebang has accumulated a lot of patents in the process of cloud-native transformation. Can you briefly introduce it?
A: In recent years, Zuoyebang has accumulated some achievements in cloud-native, and we are very happy to share and exchange with the industry. Some of them include:
At the resource layer, Zuoyebang has opened up the network of various clouds, and developed a computing lifecycle platform in terms of connectivity, high reliability, and awareness & control capability; at the container level, we have also developed a multi-cloud distribution platform. Meanwhile, a distributed log query engine solution is deployed for the service governance, which costs only 1/10 of ES, and the overall query efficiency is relatively higher. It takes less than 5 seconds to query 1 TB of logs, which greatly improves the efficiency of R & D. Besides, our solution for traffic control has reduced the loss of P90 to 0.8ms, while the open-source solution is generally at 3ms. At the application level, Zuoyebang has also built its own multi-cloud system that can be freely switched, and it is worth mentioning that the outbound call system has been built into a multi-active architecture.
Q: How do you see the development of cloud-native?
A: Cloud-native provides the following three key capabilities: containerization, service mesh, and multi-active. The ultimate purpose of the three capabilities is to release the potential that was previously confined to the cloud. The container is a base and only when it achieves 100% of the capability can the upper layers be released.
The second is the service mesh. At present, Istio is the mainstream of the industry, and there are also BAT self-research programs, with excellent acceptance in long-tail enterprises, but there are still some mechanism and performance problems. About Mesh, the industry has not reached a unified set of standards. With the formation of the container K8S standard, the standard of Mesh also needs to be done via collision, communication, and exploration of the industry.
Personally, I am optimistic about Microsoft’s Multi-Runtime idea of Dapr. It offloads more runtimes to Sidecar, which essentially further decouples middleware and business code.
Third, for the upper layer of multi-cloud multi-active, there was an Application Multi-active whitepaper released by Alibaba’s Cloud Native Summit. We can see that enterprises are demanding more and more performance from cloud-native, and the specifications and standards of cloud-native are being clarified and made more and more explicit.
Q: Could you tell us about GPU containerization and multi-cloud migration?
A: Regarding the optimization of GPU scheduling, it originated from the fact that Zuoyebang uses more resources for AI inference and image recognition. GPU is a relatively expensive resource. Through researching some solutions and communicating with cloud providers, we learned that the main recommended solution is GPU containerization, but it will bring at least 15% performance loss, which is unacceptable. However, we found that most of the GPU services use relatively fixed resources. Therefore, Zuoyebang schedules some strategies based on computing power and video memory. According to the way that these services are matched with resources, we solve it as the classic knapsack problem. At the same time, it will also make predictions and reschedule at night. If there are some failures in the process, we also execute transformation-related policies. Currently, our GPU services are 100% containerized.
The migration to multi-cloud was difficult for Zuoyebang at that time. Because we are also doing containerization transformation simultaneously, it is not easy to overlay the implementation. Our approach is to unify the service registrations, essentially bridging the gap between containers and virtual machines. The migration between multi-cloud is done in steps, and the business that needs to be migrated is decoupled in the process of service discovery. Then it can be done in batches.
Q: What changes will be brought to the technical management by the transformation of the cloud-native of Zuoyebang?
A: It is more obvious that it will have some impact on the way of operation. For the position of operation, it is difficult for medium-sized companies to accept this new type of architecture. There is less human work, and the ability of the infrastructure is given more attention, which means it is no longer limited to some repetitive and mechanical work.
The change in technology is like the change from wagons to trains. If you can migrate to the new technology in time, I believe it can bring new growth.
For technology managers, here is a call to actively join this vast world of cloud-native transformation. Cloud-native itself represents openness, not a fight between open source and vendors. I hope everyone can participate and work together to make this industry even better. Today we all push cloud-native one step forward, and tomorrow we can get a huge return from the upgrade of this technology.
At the same time, enterprises in the process of cloud-native transformation should not blindly pursue mainstream technology solutions but must be informed by the actual business situation when making choices, so as to obtain practical benefits. Team managers should actively guide the team during the cloud-native transformation to maintain a positive mindset of embracing changes. There will also be a series of objective problems, such as incomplete facilities, all of which need to be given a certain amount of time to solve.
Q: What progress Zuoyebang has made in open source?
A: Zuoyebang has always been active in giving back to the open-source community, such as the open-source logging solution; as for the next step, about open-sourcing the overall project, we hope to make the project a little better, more universal, and then ‘open source’ it. We look forward to exchanging ideas with friends in the industry on those open-sourced projects.
Containerization, service mesh, and multi-active architecture are arguably the three most important features of cloud-native development to date, which are the result of countless cloud developers working together.
As Mr. Dong said, the world of cloud-native is vast. Only more developers and enterprises participating together can help cloud-native bear fruit and change the digital intelligence world that we are living in.
Dong Xiaocong, MVP of Alibaba Cloud, and TVP of Tencent Cloud, who is in charge of the infrastructure of Zuoyebang since he joined the company in 2019. Dong leads the work of architecture R&D, operation, DBA, and other security-related sectors. His previous working experience includes architecture and technical management in Baidu and Didi. Dong excels at the construction and iteration of mid-platforms of business, technology, and R&D.
Founded in 2015, Zuoyebang aims to promote inclusive education by means of technology with two main business sectors. The first one is Zuoyebang App, which is a typical traffic-style Internet product. The second is Zuoyebang Air Class, which covers nearly all the education streaming domains, such as researching, teaching, educational administration, and personal tutoring.