Interview ask you regular principles? You just blow up with him

Posted May 25, 202010 min read

Regular Expression-A topic that programmers can't get around, an artifact that loves and hates, a point of knowledge that is often asked during interviews, a pain point that you seem to understand Yunshan Mist Hood.

After reading this article, if the interviewer asks you again in the future:"Do you understand the regularity? Tell me about its matching principle and mechanism."
Just blow him up!

Do you know regular expressions

Let me list some regular expressions first, to see if you can understand what each regular expression matches? Can the smallest component of the expression be disassembled?

let regex1 =/keith /
let regex2 =/tall | fat | skinny | low /
let regex3 =/say(hello) world /
let regex4 =/say(?:hello) world /
let regex5 =/born(? = \ d) /
let regex6 =/say [hello]world /
let regex7 = /^Target:(.*)/
let regex8 = /".*"/
let regex9 = /".*?"/
let regex10 = /(\.\d\d\[1-9]?)\d+/

If some students do n t know clearly, they should focus on()/(? :)/(? =)/\ * ?. Since our focus is on the principle of regular matching, I will only explain it grammatically and focus on the following text.

  • []:First of all, in the regular expression, the [] symbol matches only one character, that is, no matter how much content is in the brackets of [], the final matching result length is 1. So after clarifying this point, do you know what exactly the result of regex6 matches? Can it match 'sayhelloworld'?

  • ()/(? :):To understand the meaning of (? :) and other grammars, you must first understand the role of () in regular expressions.
    () Means capture in regular expression. The content in parentheses will be stored for later use. The $1/$2 ... that you see in some other places is used to call the captured element. The content matching the first parenthesis will be placed in In $1, the content matched by the second bracket will not be placed in $2, and so on.
    (? :) means non-capturing parentheses, also called non-getting matches. See the meaning of the name, that is to say, the content in parentheses will not be stored, so you can not use $1/$2 to call.

  • And (? =) And the same series of (? <=)/(? <!) Are used to match the position in the string. (? =) Means non-obtaining matches, called sequential look around. The name is a bit twisted, and I will explain it straightforwardly for you.
    First of all, it is a non-fetch match, so the result in the () will not be stored. So what does order mean? (? = pattern) means that at the beginning of the string to be matched with pattern, stand up and look backwards to check whether the match can be successful pattern. This check process does not consume the string.
    Order:Check backward at the position of the matching string.
    Look around:Just take a look, the inspection process does not consume strings.
    Believe that by seeing this, you already seem to understand the meaning of it dimly, only a little vague about the so-called non-consuming string. Congratulations, you caught the point of the (? =) Syntax, which is not to consume strings. I will explain to you by comparing with (? :).
    What both have in common is, of course, that the matching result will not be stored. The difference is that (?:Pattern) matches pattern, and(? = Pattern) matches pattern. why? Because (? =) Does not consume characters, that is, after the element is matched, it will continue to match from the position before the pre-check.

Forward preview does not consume characters.png
After reading this picture, you should understand what it means to not consume strings and also understand the usage of (? =).

  • *?:* Means match the previous character 0 or unlimited times, and match as many as possible. as much as possible? It sounds like this symbol is greedy, it is indeed called greed. Followed by a ? Means that it has been reformed, and on the basis of ensuring a successful match, as few characters as possible are swallowed. This is the non-greedy mode. for example:

      let reg1 = /a.*b/;
      let reg2 = /a.*?b/;
      'aabab'.match(reg1); //aabab
      'aabab'.match(reg1); //aab

Greedy mode:I want to eat as many characters as possible while ensuring a successful match.

Non-greedy mode:I want to reduce the intake as much as possible while ensuring a successful match.

Okay, so much is said to ensure that you have a simple regular basis, so that the understanding of the regular matching principle to be explained below will not be confused. At least when faced with a regular expression, you have to understand what it is to match, what is the length of the match, and know these to continue to learn the principle of matching. If you haven't understood it yet, please study the basic grammar first and then continue to explore the principles.

Thinking questions

Learning and reading with thinking and questions will make you do more with less. Let me list three thinking questions first, please keep reading down sensitively to the principles involved.

  1. Use /.* / to match the name "McDonald`s" is said "makudonarudo" in Japanese. What is the result? How to match only the content in the first pair of quotation marks?
  2. Is there a difference between /^ Target:(. *) / And /^ Target:(. *). * /?
  3. Requirement:Obtain 2 or 3 decimal places. /(\.\d\d[1-9]?)\d*/|| /(\.\d\d[1-9]?)\d+/|| /(\.\d\d [1-9]??) \ d +/What do these three expressions match, can they be replaced with each other?

String composition

We have to understand that the character string is composed of two parts:character and position.
Why emphasize the position in the string? It is because the position can also be matched, and the position is crucial in the process of regular matching.

When you use an expression like /Oct / to match, it matches characters rather than positions, and the matched Oct will be saved in the final result. This is the matching character. It is called possession character with the concept that we do not understand.

When you use a match similar to /Oct(? = Upus) /, the final result is still Oct, where(? = Upus) is actually in match position 3. In other words, (? = Upus) does not take up space, so it is called a zero-width or zero-width assertion.

Test & Drive

After clarifying the composition of the string, we started to enter the stage of discussing the principle of regular matching. The basis of regular matching:use the transmission mechanism to test. So what is a test and what is a drive?

To put it bluntly, the test is to start from the position of the current pointer, and use each sub-expression in the regular expression to sequentially check back to see whether it can be successfully matched. The transmission is that when the test at the current position is unsuccessful, move the pointer back one bit and continue to test the match until the match is successful or the entire string is consumed.

For a chestnut, use /ren / to match the process of " reference ":

The starting position of the pointer points to the first letter r, and the child elements of the regular expression are tested in sequence

  1. The matching test of r and r is successful; the matching test of e and e is successful; the matching test of n and f is unsuccessful, and the pointer moves one bit backward.
  2. The match test between r and e is unsuccessful, and the pointer moves one position backward.
  3. ...
  4. Until moving to the green position, r and r match test is successful, e and e match test is successful, n and n match test is successful, regular expression match is successful, and the match ends.

Regular Transmission.png

Regular expressions are like molds on the assembly line in a workshop, moving in sequence and operating in sequence. Is the word transmission vivid? A rolling belt appeared in front of my eyes. It is also very similar to the concept of sliding windows.

Quantifier && Match priority/Ignore priority

Quantifier

We give three examples of quantifiers, namely:

  • *:The previous expression matches 0 or unlimited times
  • +:The previous expression matches 1 or unlimited times
  • ?:The previous expression matches 0 or 1 time

In regular expressions, quantifier matching takes precedence by default.

What is priority? You translate to the translator.

Quantifier matching priority means that when matching, as long as the matching criteria are met, the quantifier will eat as many characters in the string as possible until the subexpression behind the regular expression cannot be matched due to too much eating. Only one character will be thrown out, and a character will be spit out to see if it can meet the matching conditions of the subsequent sub-expressions. If the match cannot be successful, the character will be thrown again and tested again until the end of the match(success or failure).

Simple traceback.png
Ignoring the priority is the non-greedy mode mentioned above. In this mode, the quantifier appears to be quite humble. I hope that I will eat as few characters as possible to complete the match. Its matching process is:

  1. The matching pointer is at position I, and the "" `in the regular expression is tested with the first character of the string" ", and the test is successful.
  2. The control is given to the sub-expression . *?, And the matching pointer is moved to position II. At the character o, the non-greedy mode when matching or mismatching can be selected, the mismatch is preferentially selected, ie Ignore the sub-expression . *? And give control to the sub-expression " .
  3. The sub-expression " begins to test the character o, the test fails, and control is returned to the sub-expression . *?.
  4. The sub-expression . *? Begins to test the character o after eating the next character. The test is successful, and the matching pointer moves one position to the position III. Repeat steps 2-4.
  5. Repeat the above process until position VIII.
  6. When matching the character s, the matching pointer moves back to the position IX, repeat steps 2 & 3, when the sub-expression " character" `is tested, the test is successful and the matching ends successfully.

Ignore priority.png

(Here CUE think about the question, you must have not forgotten to read it with the question)

Backtracking

The so-called [return control to the subexpression . *?]Mentioned in the non-greedy mode above actually has a special word to describe it, called backtracking. Let's talk about backtracking in detail.

Regular expressions will encounter side-by-side selection during the matching process, that is to say, there are multiple choices at the same position(for example, when using . * Matching, you can choose to match multiple or you can choose to match 0) At this time, the regular engine will choose one of them according to the regulations, and record the other several in its own small book, clearly reminding itself that there are other forks where you can walk. When the current road is not accessible , You can read the previous archive and choose another way to match again. Such a process is backtracking.

In the process of the matching process, there may be a plurality of reminder signs in different positions, and the regular engine will preferentially select the mark position closest to its current position for backtracking.

This logic is also easy to understand. I will give you an analogy that makes it clear to everyone. Suppose you are playing an adventure puzzle game, and you have a good habit of saving files when you encounter a fork. When you hit level 9 in one go and find that this path is not working, you will read your archive at the beginning of level 9 and try another path again. If it still doesn't work this way, you will go to the 8th level of the file, and then choose a different path to try from the previous time when you passed the 8th level. And so on, until the entire regular expression match succeeds or fails.
Backtracking selection.png

No picture D, we still look at the picture and speak.

I use /love \ s ?? b(?:Erry | anana) / to match " I love banana! ", And its flow is shown in the figure.

  1. The character love test is successful, continue to match backwards.
  2. Encountered the first traceable bifurcation \ s ??, because it is a non-greedy mode, preferentially do not match \ s to take the green path, and record the path matching \ s.
  3. Encounter the second traceable bifurcation erry | anana, take the red path, and record the path of anana.
  4. The matching test of erry is unsuccessful, go back to node 2 and go to the blue path to test ananan. The test was also unsuccessful, backtracking to node 1, matching the test \ s.
  5. Successfully tested b, encountered the third one which can be traced back to the branch erry | anana, take the yellow path, and record the path of anana.
  6. The erry matching test was unsuccessful, back to node 3, and the pink path was used to test ananan`. The match is successful.

In this way, you should understand the regular backtracking mechanism.

Thinking puzzles

Let's review the three thinking questions at the beginning of the text again.

  1. Use /.* / to match the name "McDonald`s" is said "makudonarudo" in Japanese. What is the result? How to match only the content in the first pair of quotation marks?
  2. Is there a difference between /^ Target:(. *) / And /^ Target:(. *). * /?
  3. Requirement:Obtain 2 or 3 decimal places. /(\.\d\d[1-9]?)\d*/|| /(\.\d\d[1-9]?)\d+/|| /(\.\d\d [1-9]??) \ d +/What do these three expressions match, can they be replaced with each other?

After thinking over and reading the explanation above, is it clear at a glance?

1 => Since * is a greedy pattern matching priority, it will match as many characters as possible, so the final match result is "" McDonalds "is said" makudonarudo "", and want All that needs to be done to match the content in the first pair of quotation marks is to convert it to a non-greedy mode, ie /.*?/`.

2 => There is no difference. (. *), Which is also in greedy mode, has eaten all the characters successfully matched, and the subsequent . * Has no effect at all.

3 => To understand this question, we need to make it clear that the content we can finally get is the content in ().

Let's take an example to match a string with three decimal places " 96.124 " for comparison:

  • Expression 1 vs Expression 2

    The difference is that the quantifier followed by \ d that matches the number at the end is different,*needs to match 0 or more, and+needs to match 1 or more. Expression 1 is also a greedy pattern match. In the previous \ d \ d [1-9]?, All the decimal places are swallowed, and then in the latter \ d *, it can also match 0 digits. Success, so the content captured in () is the whole decimal place after 124 decimal places. In expression 2, \ d \ d [1-9]? After swallowing all decimals, the subsequent \ d + must match at least 1 digit, and at this time it cannot meet its requirements, then \ d \ d [1-9]? will spit out one more decimal point to make it match successfully, and the content captured in the final() becomes 12`, unable to capture the third decimal point.

  • Expression 3

    The \ d \ d [1-9]?? in Expression 3 is a non-greedy mode, which prefers not to eat characters. After \ d + after swallowing all the two decimal places first, in order to make \ d [1-9]?? The match was successful, and a character was spit out. The match was successful. In the end, the content captured in() was still only two decimal places 12.

Classroom test

After reading through it, I believe you have a holistic understanding and understanding of the simple matching principle of regular expressions. You are already trying to test your learning achievements. Okay, let me ask the last question to test your mastery of the regular matching process.

Used to match strings:" 2 \ "x3 \" likeness "

/"(\\. | [^ \\ "]) *"/&/"([^ \\."]| \\.) * "/ How many times each of the two regular expressions was tested ? How many times did it go back?

answer:

Expression 1 => Test 32 times, backtrack 14 times

Expression 2 => Test 22 times, backtrack 4 times

Are you very surprised? Compared with Expression 1, Expression 2 reduces the number of tests by 31.25%! The number of backtrackings is even scarier, a reduction of 71%? ? ! What caused this was simply to reverse the order of two parallel subexpressions. This shows that it is a technical job to improve the efficiency of regular expressions! Scientific enough expressions will greatly improve the matching performance, and these, we will continue to analyze in the next article!

Thanks for the technical support here: Philip ?

Scan the code and pay attention, from time to time to bring you a variety of concise and clear analysis in the way of speaking people.

Reply to the "regular principle test" in the public account for detailed analysis of the matching process.
qrcode.jpg

Yi Qixiu Engineer-Keith