Why divide tables and libraries?
Posted May 28, 2020 • 9 min read
Why divide the database into tables(when designing a high concurrency system, how to design at the database level)? Which sub-library sub-table middleware has been used? What are the advantages and disadvantages of different sub-library sub-table middleware? How do you split the database vertically or horizontally?
In fact, this piece is definitely related to high concurrency, because the sub-database and sub-table must be for supporting high concurrency and large data volume. And to tell the truth now, especially Internet company interviews, basically come here, such a common technical problem of sub-library and table, it is not okay to ask, and if you do n t know it, it is really unreasonable!
Analysis of interview questions
Why separate tables and tables?(How to design at the database level when designing a highly concurrent system?)
To put it bluntly, sub-library sub-tables are two different things. Don't confuse them. It may be that optical sub-bases have no sub-tables or optical sub-tables without sub-bases.
Let me throw out a scene for everyone first.
If we are now a small startup company(or a new department that BAT has just started), there are now 200,000 registered users, 10,000 daily active users, and 1,000 single-table data amounts per day, and then every second during the peak period The maximum number of concurrent requests is 10. My God, for this kind of system, just find someone who has several years of work experience, and then bring a few that have just been trained, and do whatever you want.
As a result, we didn't expect our luck to be so good. When we met a CEO and took us to Kangzhuang Avenue, the business developed rapidly. After a few months, the number of registered users reached 20 million! 1 million active users every day! 100,000 data per day in a single table! The maximum request per second during the peak period reaches 1000! At the same time, the company also took advantage of two rounds of financing, and paid in hundreds of millions of yuan! The company's valuation has reached an amazing hundreds of millions of dollars! This is the rhythm of the little unicorn!
Okay, it s okay, now everyone feels a little bit stressed, why? Because there are 100,000 more data per day, and 3 million more data in a month, now we have millions of data in a single table, and it will break 10 million soon. But barely able to hold on. The peak request is now 1000, we deployed several machines online, load balancing was done, and the database support 1000QPS is also OK. But everyone is starting to feel a little worried now, what's next ...
In the next few months, my God, the CEO is too powerful, the company has reached 100 million users, and the company continues to raise billions of RMB! The company's valuation has reached an astonishing billions of dollars, and it has become the most powerful star startup in China this year! God, we are so lucky.
But we are also unfortunate, because there are tens of millions of active users every day, and up to 500,000 new data are added to a single table every day. At present, the total data volume of a table has reached 20-30 million! Can't bear it! The database disk capacity is constantly being consumed! The peak concurrency reaches an astonishing
5000 ~ 8000! No kidding, brother. I assure you that your system support is not up to now, it has been hung up!
Okay, so you can almost understand what is going on in the database and table when you see this. In fact, this is following your company's business development. The better your company's business development is, the more users there are, and the more data you have Larger, the greater the amount of requests, then your single database must not be able to bear.
For example, you have tens of millions of data in a single table, are you sure you can carry it? Absolutely not. The amount of data in a single table is too large, which will greatly affect the performance of your SQL execution. In the future, your SQL may run very slowly. Generally speaking, in my experience, when a single table reaches several million, the performance will be relatively poor, and you will score the table.
What does sub-table mean? That is to put the data of one table into multiple tables, and then when you query, you will check a table. For example, according to the user id to divide the table, put a user's data in a table. Then, when you operate, you can operate that table for a user. This can control the data volume of each table within a controllable range, for example, each table is fixed within 2 million.
What does sub-library mean? In general, our experience is that a library supports up to 2000 concurrent, and it must be expanded, and you should keep a healthy single library concurrency value of about 1000 per second, not too big. Then you can split the data of one library into multiple libraries, and visit one library when you access it.
This is called sub-library sub-table, why should sub-library sub-table? You get it.
|#||Before the library is divided into tables||After the library is divided into tables|
|Concurrency support||MySQL stand-alone deployment, can not bear high concurrency||MySQL from stand-alone to multi-machine, the number of concurrency that can withstand has increased many times|
|Disk usage||MySQL single-machine disk capacity is almost full||Split into multiple libraries, database server disk usage is greatly reduced|
|SQL execution performance||The amount of data in a single table is too large, and SQL runs slower and slower||The amount of data in a single table is reduced, and the efficiency of SQL execution is significantly improved|
Which middlewares have been used for sub-library sub-tables? What are the advantages and disadvantages of different sub-library sub-table middleware?
In fact, this is to see which middleware is divided into libraries and tables. What are the advantages and disadvantages of each middleware? Then which middleware you have used for the library and table.
The more common ones include:
Developed and open sourced by the Ali b2b team, it belongs to the proxy layer solution, which is between the application server and the database server. The application accesses the Cobar cluster through the JDBC driver. Cobar decomposes the SQL according to the SQL and database rules, and then distributes it to different database instances in the MySQL cluster for execution. It can be used in the early years, but it has not been updated in recent years, and it is basically not used by anyone. It is almost abandoned. And does not support read and write separation, stored procedures, cross-database join and paging operations.
Developed by the Taobao team, it belongs to the client layer solution. Supports basic crud syntax and read-write separation, but does not support join, multi-table query and other syntax. Not much is currently used because it also relies on Taobao's diamond configuration management system.
360 is open source and belongs to the proxy layer scheme. Some companies have used it before, but there is a big problem that the latest maintenance in the community was 5 years ago. Therefore, there are few companies in use today.
Dangdang is open source and belongs to the client layer scheme. It is the client layer scheme of
also provides proxy Layer scheme Sharding-Proxy. It is true that I have used more before, because there are more SQL syntax support, there are not too many restrictions, and as of 2019.4, it has been released to the
4.0.0-RC1 version, which supports sub-database sub-table, read-write separation, distributed id generation, flexible transactions(maximum delivery services, TCC transactions). And it is true that more companies have been used before(this company is registered on the official website, you can see that from 2017 to now, there are many companies in use), and the community has also been developing and maintaining. It is still relatively active, and I personally think that it is now a optional solution.
Based on Cobar's transformation, it belongs to the proxy layer solution. The supported functions are very complete, and it should be a very popular and popular database middleware. The community is very active, and some companies are beginning to use it. But it is indeed younger and less tempered than Sharding jdbc.
to sum up
In summary, it is actually recommended to consider Sharding-jdbc and Mycat, both of which can be considered for use.
Sharding-jdbc This client layer solution has the following advantages:no deployment, low operation and maintenance costs, no secondary forwarding requests at the proxy layer, and high performance , but if it encounters upgrades, all systems need to be upgraded. The version is released again, each system needs to be coupled ** Sharding-jdbc dependent;
The disadvantage of the proxy layer solution of Mycat is that it needs to be deployed. Owning a set of middleware, the operation and maintenance cost is high, but the advantage is that it is transparent for each project. ** If you encounter an upgrade or the like It s all about middleware.
Generally speaking, both of these solutions can be used, but I personally recommend small and medium-sized companies to use Sharding-jdbc. The client-level solution is light, and the maintenance cost is low, no additional staff is required, and the system complexity of small and medium-sized companies will Lower, there are not so many projects; but medium and large companies are better to choose the proxy layer solution such as Mycat, because there may be many large company systems and projects, a large team, and sufficient staff, so it is best to get someone to study And maintain Mycat, then a large number of projects can be used directly and transparently.
How exactly do you split the database vertically or horizontally?
Horizontal split means that the data of one table is transferred to multiple tables in multiple libraries, but the table structure of each library is the same, but the data placed in each library table is different Yes, the data of all the library tables add up to all the data. The meaning of horizontal splitting is to put the data evenly in more libraries, and then use multiple libraries to carry higher concurrency, and use the storage capacity of multiple libraries to expand.
Vertical split means split a table with many fields into multiple tables, or multiple libraries. Each library table has a different structure, and each library table contains some fields. In general, will put less frequently accessed fields into one table, then put more frequently accessed fields into another table. Because the database is cached, the fewer rows you access frequently, the more rows you can cache in the cache, and the better the performance. This is generally done more on the surface.
This is actually quite common. I do n t necessarily say that many students may have done it by themselves, dismantling a large watch, order form, order payment form, order goods list.
There is also a table-level split, which is to divide tables, turning a table into N tables, which is to control the data volume of each table within a certain range, to ensure the performance of SQL. Otherwise, the larger the amount of data in a single table, the worse the SQL performance. Generally, it is about 2 million lines, not too much, but it depends on how you operate, it may be 5 million, or 1 million. The more complex your SQL, the better it is to have fewer rows per table.
Well, no matter whether it is sub-library or sub-table, the database middleware mentioned above can be supported. It is basically that middleware can do it after you divide the library and table, the middleware can be based on a field value you specify, such as userid, automatically routed to the corresponding library, and then automatically routed Go to the corresponding table.
You have to think about how to divide the library and table in your project? In general, you can do vertical splitting at the table level, and split some tables with many fields; horizontal splitting, you can say that it cannot be carried concurrently, or the data volume is too large to hold the capacity. , You have to tear down, according to what field to tear down, you think about it; sub-table, you think about it, if you even go to each library, concurrency and capacity are ok, but the table of each library is still If it is too large, then you will divide the table and separate the table to ensure that the data volume of each table is not very large.
And there are two ways to divide the library and table **:
- One is divided according to range, that is, a piece of continuous data in each library, this is generally based on, for example, time range, but this is generally less used, because it is easy to produce hot spots and a lot of traffic All hit the latest data.
- Or it is evenly distributed according to a certain field hash, which is more commonly used.
The advantage of range is that the expansion is very simple, because you only need to prepare and prepare a library for each month. When a new month comes, naturally, a new library will be written; Disadvantages, but most requests are to access the latest data. The actual production range depends on the scene.
The advantage of hash distribution is that the data volume and request pressure of each library can be evenly distributed; the disadvantage is that expansion is more troublesome, and there will be a data migration process. The previous data needs to be recalculated hash value and redistributed to different libraries Or table.