Big data interview questions summary

Posted May 28, 20204 min read

1. Big Data Technology

1.1 Describe the checkpoint process?
Checkpoint is actually the status of all tasks, a copy(a snapshot) at a certain point in time, this time point should be when all tasks have just processed a same input data.
1.2 Describe the two-phase submission?
For each checkpoint, the sink task will start a transaction and add all the received data to the transaction, and then write these data(streaming) to the external sink system, but do not submit them-at this time just pre Submit(write one by one), when it receives the notification of checkpoint completion, it officially submits the transaction and realizes the real write of the result(one-time confirmation). This way really realizes exactly-once, it needs a provide For external sink systems supported by transactions, Flink provides the TwoPhaseCommitSinkFunction interface.
1.3 Describe the principle of Spark DAG?
DAG(Directed Acyclic Graph) Directed Acyclic Graph, Spark uses DAG to model the relationship of RDD, describes the dependency of RDD, this relationship is also called lineage "blood", the dependency of RDD uses Dependency For maintenance, Dependency is divided into wide dependencies(one parent RDD is used by multiple child RDD partitions) and narrow dependencies(one parent RDD is only used by one child RDD partition). The corresponding implementation of DAG in Spark is DAGScheduler.
1.4 Briefly describe the difference between spark and flink
The technical concept is different:spark uses micro-batch to simulate the calculation of the stream, and flink is based on event-driven, that is, a stateful stream processing application, which is true stream computing.
1.5 How does flink handle delayed data?
Using a watermark mechanism, a watermark(watermark) is a timestamp. Flink can add a watermark to the data stream, allowing a certain delay timestamp.
1.6 How to guarantee flink fault tolerance mechanism?
checkpoint mechanism
1.7 What time semantics does flink support?
(1) Processing Time:The system time of the machine when the time is processed.
(2) Event Time:Event time.
(3) Ingestion Time:The time when the event enters Flink.
1.8 What are the action operations of spark and flink?
(1) Get elements
collect(), first(), take(n), takeSample(withReplacement, num, \ [seed ]), takeOrdered(n, \ [ordering ])
(2) Counting elements
count(), countByKey()
(3) Iterative elements
reduce(func), foreach(func)
(4) Save the element
saveAsTextFile(path), saveAsSequenceFile(path), saveAsObjectFile(path)
It seems there is no concept of action
1.9 Hadoop HA high availability?
Hadoop high availability has ActiveNN and StandbyNameNode, using zookeeper service to switch the active and standby NN. Namenode is mainly to assign tasks to child nodes.
1.10 Spark tuning?
You can set the number of Executors, memory, and cores through runtime parameters.
1.11 Describe the mapreduce calculation model?
1.12 What is the difference between flink checkpoint and savepoint?
The working principle of savepoint and checkpoint is the same, savepoint is triggered manually, checkpoint is triggered automatically.


2.1 Understanding of jvm?
In the Java SE 7 version, the jvm can be divided into 5 areas, namely
(1) Data area shared by all threads:method area(data such as code compiled by the compiler), heap(object instance and array)
(2) Thread-isolated data area:virtual machine stack, local method stack, program counter

The RDD in Spark is actually a Java object, so it is stored in the heap in the JVM. Because the heap stores objects and array instances, garbage collection is mainly performed in the heap(it may also be in the method area)
2.2 What can reflection do?
JAVA reflection mechanism is in the running state, for any class, you can know all the properties and methods of this class; for any object, you can call any of its methods and properties; this dynamic acquisition of information and dynamic call object methods The function is called the reflection mechanism of java language.
2.3 Polymorphic understanding?
The three necessary conditions for the existence of Java polymorphism(1) inheritance(2) rewrite(3) parent class reference points to child class object


3.1 Understanding of Trait features?
The feature is equivalent to the interface of java. What is different from the interface is that it can define the implementation of attributes and methods(powerful), and multiple inheritance of scala can be achieved through traits. Subclass inheritance features can implement unimplemented methods. So in fact Scala Trait(feature) is more like Java's abstract class.
3.2 What is the difference between scala and java?
Scala does not support interfaces and uses trait features; scala provides associated objects to implement singletons; scala supports methods that take functions as parameters Java does not support

4. Algorithm

4.1 Describe the basic idea of dynamic programming?
Dynamic programming is similar to the divide-and-conquer method. Both construct sub-problem solutions to solve the original problem. The difference is that dynamic programming solves overlapping sub-problems. The results of the sub-problem solutions are recorded in a table, and space is exchanged for time.
Typical dynamic programming problem:backpack problem:01 backpack problem, complete backpack problem.
4.2 What data structures do you understand? Briefly describe the data structure of linked lists, red and black trees, etc.

5. Other

5.1 Common components and functions of big data?
hdfs is a distributed storage system; yarn is a distributed resource management system; mapreduce is a computing framework for hadoop; zookeeper is a distributed coordination service; hbase is a distributed database under hadoop; hive is a distributed data warehouse and a data analysis tool , The bottom uses mapreduce; Sqoop is a tool for importing traditional data into hdfs or hbase; spark memory-based distributed processing framework; flink is a distributed stream processing framework;


6.1 Which command do you use to troubleshoot big data faults in Linux?
View port occupancy:(1) lsof -i:port number(2) netstat -tunlp | grep port number
View memory usage:(1) real-time usage "top"(2) free -m