DataX Web 2.1.2 released

Posted Jun 16, 202010 min read

DataX-Web

DataX Ali did not provide any visual interface when it was open source. During our use, we need to put the Json configuration file in the DataX job path. As the business increases, the configuration file is not convenient for management and migration and is executed every time. Both need to record commands.
At present, DataX only supports the stand-alone version, and collaboration between multiple nodes cannot be controlled. We hope to have a friendly visual interface, support scheduled tasks, and support distributed data synchronization tools, which is also the goal of the project.

v-2.1.2

New

  1. Add a project management module to manage the tasks by category;
  2. Add batch task creation function to the RDBMS data source, select the data source, and the table can generate DataX synchronization tasks in batches according to the template;
  3. JSON construction adds ClickHouse data source support;
  4. Actuator CPU. Memory. Load monitoring page graphical;
  5. RDBMS data source incremental extraction increases the primary key self-increment mode and optimizes page parameter configuration;
  6. Replace the connection method of MongoDB data source and reconstruct the JSON building block of HBase data source;
  7. Script type task adds stop function;
  8. rdbms json construction adds postSql, and supports the construction of multiple preSql, postSql;
  9. Incorporate the datax-registry module into datax-rpc;
  10. Data source information encryption algorithm modification and code optimization;
  11. Time increment synchronization supports more time formats;
  12. LogX page adds DataX execution result statistics;

Upgrade:

  1. PostgreSql, SQLServer, Oracle data source JSON construction adds schema name selection;
  2. The optimization of the field name in DataX JSON is consistent with the data source keyword;
  3. Task management page button display optimization;
  4. Add task description information to the log management page;
  5. Fix the problem that the front-end form of JSON cannot cache data;
  6. HIVE JSON construction adds head and tail options parameters;

Remarks:

Version 2.1.1 is not recommended to be upgraded. Changing the data source information encryption method will cause the decryption of the previously encrypted data source to fail and the task to fail to run.
If you need to upgrade, please rebuild the data source, task.

System Requirements

  • Language:Java 8(jdk version recommended above 1.8.201)

    Python2.7(to support Python3, you need to modify and replace the three python files under datax/bin, and the replacement files are under doc/datax-web/datax-python3)

  • Environment:MacOS, Windows, Linux

  • Database:Mysql5.7

Features

    1. Build DataX Json through Web;
    1. DataX Json is stored in the database to facilitate the migration and management of tasks;
    1. Web real-time view of extracted logs, similar to Jenkins' log console output function;
    1. DataX running record display, page operation can stop DataX operation;
    1. Support DataX scheduled tasks, support dynamic modification of task status, start/stop tasks, and termination of running tasks, effective immediately;
    1. Centralized design is used for scheduling, supporting cluster deployment;
    1. Distributed task execution, task "executor" supports cluster deployment;
    1. The executor will automatically register tasks periodically, and the dispatch center will automatically discover the registered tasks and trigger execution;
    1. Routing strategy:Provides a wealth of routing strategies when the actuator cluster is deployed, including:first, last, polling, random, consistent HASH, least frequently used, most recently unused, failover, busy transfer Wait;
    1. Congestion processing strategy:the scheduling strategy is too dense when the executor is too late to process. The strategies include:stand-alone serial(default), discarding subsequent scheduling, scheduling before overwriting;
    1. Task time-out control:support custom task time-out time, the task will automatically interrupt the task when it runs overtime;
    1. Task failure retry:support custom task failure retry times. When the task fails, it will retry automatically according to the preset failure retry times;
    1. Task failure alarm; by default, the mail method failure alarm is provided, and the expansion interface is reserved at the same time, which can easily expand the alarm methods such as short message and nail;
  • 14, user management:support online management system users, there are two roles of administrator and ordinary users;
    1. Task Dependency:Support to configure subtask dependency. When the parent task execution is completed and the execution is successful, it will actively trigger the execution of a subtask. Multiple subtasks are separated by commas;
    1. Running report:support real-time viewing of running data and scheduling reports, such as scheduling date distribution map, scheduling success distribution map, etc.;
    1. Specify the increment field, configure the scheduled task to automatically obtain each data interval, retry the task failure to ensure data security;
    1. The page can be configured with DataX startup JVM parameters;
    1. Add manual test function after successful data source configuration;
    1. You can configure templates for common tasks. After building the JSON, you can select the associated template to create the task;
    1. jdbc adds hive data source support, you can select the data source to generate column information and simplify the configuration on the JSON construction page;
    1. Preferentially obtain the DataX file directory through environment variables, without specifying JSON and log directory during cluster deployment;
    1. Specify the hive partition through dynamic parameter configuration, and can also cooperate with the incremental to realize the dynamic insertion of incremental data into the partition;
    1. The task type is expanded from the original DataX task to Shell task, Python task, PowerShell task;
    1. Add HBase data source support, JSON construction can obtain hbaseConfig, column through HBase data source;
    1. Add MongoDB data source support, users only need to select collectionName to complete json construction;
    1. Add monitoring pages for actuator CPU, memory and load;
    1. Add 24 types of plug-in DataX JSON configuration examples
    1. Public fields(creation time, creator, modification time, modifier) are automatically filled in when inserting or updating
    1. Perform token verification on the swagger interface
    1. Increase the timeout time of the task. To kill the datax process of the timeout task, cooperate with the retry strategy to avoid the datax stuck caused by network problems.
    1. Add a project management module to manage the tasks by category;
    1. Add batch task creation function to the RDBMS data source, select the data source, and the table can generate DataX synchronization tasks in batches according to the template;
    1. JSON construction adds ClickHouse data source support;
  • 35, Actuator CPU. Memory. Load monitoring page graphical;
    1. RDBMS data source incremental extraction increases primary key self-increment mode and optimizes page parameter configuration;
    1. Change the connection method of MongoDB data source and reconstruct the JSON building block of HBase data source;
    1. Script type task adds stop function;
  • 39, RDBMS JSON construction adds postSql, and supports the construction of multiple preSql, postSql;
    1. Data source information encryption algorithm modification and code optimization;
    1. Add statistical data of DataX execution results in the log page;

Quick Start:

Please click: Quick Start
Please click: One-click deployment

Introduction:

1. Actuator configuration(using open source project xxl-job)

    1. On the right of "Scheduling Center OnLine:", the online "Scheduling Center" list will be displayed on the right. After the task execution is completed, the callback scheduling center will be called in failover mode to notify the execution result, avoiding the single point risk of callback;
    1. The online actuator list is displayed in the "Actuator List", and the cluster machine corresponding to the actuator can be viewed through the "OnLine Machine";

Actuator attribute description

1. AppName:(same as datax.job.executor.appname of application.yml in datax-executor)
   The unique name of each actuator cluster AppName, the actuator will periodically register with AppName as an object. This configuration can automatically discover the executors registered successfully for use in task scheduling;
2. Name:The name of the actuator, because AppName restricts the composition of letters and numbers, etc., the readability is not strong, the name is to improve the readability of the actuator;
3. Sorting:sorting of actuators, where actuators are needed in the system, if tasks are added, the list of available actuators will be read according to the sorting;
4. Registration method:the way the dispatch center obtains the address of the actuator;
    Automatic registration:the executor automatically performs executor registration, and the dispatch center can dynamically discover the executor machine address through the underlying registry;
    Manual entry:manually enter the address information of the actuator, separated by commas, for use by the dispatch center;
5. Machine address:effective when "Registration Mode" is "Manual Entry", support manual maintenance of the address information of the actuator;

2. Create a data source

The fourth step is to use

3. Create a task template

The fourth step is to use

4. Building a JSON script

    1. Step one, step two, select the data source created in the second step, JSON construction currently supports data sources such as hive, mysql, oracle, postgresql, sqlserver, hbase, mongodb, clickhouse JSON construction of other data sources is under development In the meantime, it needs to be written manually.

    1. Field mapping

    1. Click Build to generate json. At this time, you can choose to copy json and then create a task, select the datax task, and paste the json into the text box. You can also click to select a template to directly generate the task.

5. Create tasks in batches


DataX task

Shell task

Python task

PowerShell task

  • Task type:Currently supports DataX task, Shell task, Python task, PowerShell task;

  • Blocking processing strategy:the processing strategy when scheduling too intensive actuators is too late to process;

    • Single machine serial:After the scheduling request enters the single machine executor, the scheduling request enters the FIFO queue and runs in serial mode;
    • Discard subsequent scheduling:After the scheduling request enters the stand-alone executor, it is found that the executor has a scheduled task running. This request will be discarded and marked as failed;
    • Scheduling before overwriting:After the scheduling request enters the stand-alone executor, it is found that the executor has a running scheduled task, it will terminate the running scheduled task and clear the queue, and then run the local scheduled task;
  • Incrementally recommended to set the blocking strategy to discard subsequent scheduling or stand-alone serial

    • When setting up a single machine serial, you should pay attention to set the number of retries(the number of failed retries*each execution time <the scheduling period of the task), if the number of retries is too large, it will cause data duplication, such as 30 seconds for the task Execute once, each execution time takes 20 seconds, set retry three times, if the task fails, the first retry period is 1577755680-1577756680, the retry task is not over, the new task is started again, then the time of the new task The session will be 1577755680-1577758680
  • Incremental parameter setting

  • Partition parameter setting

7. Task list

8. You can click to view the log, get the log information in real time, and terminate the ongoing datax process



9. Task resource monitoring

10. admin can create users and edit user information

UI

Front-end github address

Contributing

Contributions are welcome! Open a pull request to fix a bug, or open an Issue to discuss a new feature or change.

Welcome to contribute to the project! For example, submit a PR to fix a bug, or create a new issue to discuss new features or changes.

Copyright and License

MIT License

Copyright(c) 2020 WeiYe

The product is open source and free, and will continue to provide free community technical support. Individuals or companies can access and use it freely.

Welcome to register at registered address , registration is only for product promotion and to enhance the power of community development.

v-2.1.1

New

  1. Add HBase data source support, JSON construction can obtain hbaseConfig, column through HBase data source;
  2. Add MongoDB data source support, users only need to select collectionName to complete json construction;
  3. Add the monitoring page of actuator CPU, memory and load;
  4. Add 24 types of plug-in DataX JSON configuration examples
  5. Public fields(creation time, creator, modification time, modifier) are automatically filled in when inserting or updating
  6. Token verification of the swagger interface
  7. Increase the timeout time of the task. To kill the datax process of the timeout task, cooperate with the retry strategy to avoid the datax stuck caused by network problems.

Upgrade:

  1. Data source management encrypts the user name and password to improve security;
  2. Encrypt the username and password in the JSON file and decrypt it when performing the DataX task
  3. Interactive optimization of page menu arrangement, icon upgrade, prompt information, etc.;
  4. Log output cancels irrelevant information such as project class name, reduces file size, optimizes large file output, and optimizes page display;
  5. logback is to get the log path configuration from yml

Fix:

  1. When the task log is too large, check the log and report an error, the request times out;

Project planning

Contact us

QQ communication group 795380631

Related Posts