OWASP Podcast/Transcripts/056

Interview with Adar Weidman (OWASP Podcast 56)
Jim Manico
 * We have with us today Adar Weidman. Adar will be speaking with us about Regular Expression Denial of Service.

Adar Weidman
 * Hello, I'm Adar Weidman. I'm a senior software engineer from Checkmarx.  I've been dealing with software for many years, security only a few years, and I enjoy it very much.

Jim Manico
 * So Adar, can you start by telling us about the performance of regular expressions?

Adar Weidman
 * Yes, regular expression performance is very fast. It's an extremely useful tool for verification of input and text.  It can be applied to exceptionally large inputs to get real time results.  This excellent performance and ease of use is the main reason why we use these tools in so many places.  I, for myself, use it a lot.  Even today, I wrote some code using regular expression.

Jim Manico
 * So does the performance of a regular expression change significantly as the complexity of the regular expression changes, or even as the complexity of the input changes?

Adar Weidman
 * Generally, regular expression performance does not change significantly as the input changes. It might change if the regular expression itself changes, but it takes longer to pass on the input, depending on its size, but there shouldn't be any performance problems, unless you are talking about really long inputs of millions of bytes.  In extreme cases, however, regular expression performance is not only slow, but it gets slower and slower as the input length grows.  Also, on rare occasions where an input of only 20-30 bytes might cause your computer to hang with a very, very simple regular expression…This is where people often get confused…How come this is so slow?  Why all of a sudden…And this is the area where hackers can start to penetrate to the system.

Jim Manico
 * So Adar, what are some of the common ways that someone could attack a regular expression engine?

Adar Weidman
 * Well, a regular expression engine can get that way, forcing it to try all possible pass in a regular expression until the engine is just exhausted. I will try to demonstrate this problem with an example.  It may not be very simple to follow, but please bear with me.  Let us look at a very simple regex of A+(+B).  Now, let's try to run it on an input containing only 30 As and then a C, not a B.  I have 30 As and a C.  Regular expression will try all the combinations of A related to the first plus, then the second plus before it decides that there is no match.  This process is called backtracking, but there are many combinations, so the regular expression engine will continue to try and try again for a very long time.  This is called ReDoS, Regular Expression Denial of Service, because we have a denial of service situation caused by a regular expression, so we have backtracking that causes ReDoS.  Regex is called evil if it can be…on specially crafted input.  A typical evil regex pattern will contain a grouping construct, two repetitions, and inside this repeated group another repetition.  There are other types of patterns, but this is the main one.  For any evil regex pattern, there are specially crafted inputs that can be used.

Jim Manico
 * So regular expressions have been around for quite some time. Is this a long known class of vulnerability?  Is there anything new about this research?

Adar Weidman
 * The problem is long known, yes, but if you browse the existing list of vulnerabilities available online, you will not be able to locate this one. This is because most people never look at the vulnerability, but rather as a singular problem, a bug in the system that should be resolved.  We need to change this perspective and understand it's a standard vulnerability.  We can build attack vectors.  We can look for vulnerable points in the system.  We can also implement countermeasures.  All of these were not done before, at least not in an extensive way, and we think this is the time for it.

Jim Manico
 * So we use regular expressions everywhere, especially in security, so what do you think the overall industry impact is for regular expression based Denial of Service?

Adar Weidman
 * Regex is all over the web, on the client's side in browsers, cell phones, other devices. On the server's side…Actually, in every code for that manipulation, it is a potential danger everywhere, so attackers can easily look for vulnerable code since the use of regex is only growing from day to day, these vulnerabilities will continue to appear.  When the client's side is attacked, one can close the application, turn off and on the machine or cell phone.  It is quite unpleasant, but when the server's side is attacked, it is a serious denial of service, which today is not given enough attention, and it really exists.

Jim Manico
 * So a lot of security products depend upon black listing and regular expression based validation. Are we going to see these kind of issues in WAFs and similar technologies?

Adar Weidman
 * Yes, of course. WAFs, web application firewalls, including detection systems, proxies, even databases, all of them are vulnerable to attack.  In general, some of these products are secure systems, but in this case, their existence increases the attack space by including regexes that anyone can rewrite and add.  Web application firewall experts will probably not write evil regexes, but simple users like you and I will just write a complex regex, hoping for it to defend ourselves, but in reality might give the attacker everything he wants, an evil regex.  I will show you now some attack scenarios just for you to understand the idea.  The attacker first will look for a vulnerable system.  This can be done by looking for evil patterns in Google code search and using…regexes to find potentially ReDoS'ed applications, ReDoS is for Regular Expression Denial of Service.  I recently did a short search myself and found over a dozen such examples.  It's amazing.  Now the attacker knows the regex, and it is an open attack, not a blind attack.  He can just look at it and craft specific input for it.  If the code has no open source, the attacker can still attack by checking the validated user input fields.  The attacker can then try to find the regex injection vulnerable input by submitting an invalid escape sequence, such as backslash M, for example.  If a message like invalid escape sequence is generated, then there is a regex injection and bingo, the attacker can submit an evil regex.  Another example is when one uses the same validation on the client's and the server's side.  We all know it is good practice to validate the client and the server, but this point is known also to the attacker, so using the same validation will expose the regex in the server to the attacker.  The attacker can then build a well-crafted input until it freezes the engine.  Lastly, another possible attack can be achieved by writing a script with an evil regex and attempting the victim to surf to this link, resulting in the victim's browsers or even cell phone or other device getting stuck.  Unfortunately, most of the browsers we checked do not have any defense, and some of them after a few minutes, you get the message of I am stuck, do you want to kill me.  Internet browsers do spend much effort trying to prevent denial of service on them.  Issues like browsers prevent are, for example, infinite loops, long…statements, endless…but not regular expression.  Try to run an evil regex with a problematic input on any version of Internet Explorer, for example.  You might not like the result.  The regex engines that have efficient algorithms, for example…that can deal with most, maybe all ReDoS attacks, but these algorithms are less common.  In most existing programming languages, unfortunately, for example dot net or Java and most of the browsers, as I said before, the simplest solution is implemented is just helping hackers to attack.

Jim Manico
 * Adar, I thought that a regular expression could be translated in a way that it could be efficiently solved for any input, logarithmic performance growth as the input gets more complex, which is fairly scalable. Is that even correct?

Adar Weidman
 * Well, small is correct, yes Jim. Regexes can be translated to what is called finite automata, which can be efficiently solved.  I'm not sure about logarithmic, but efficiently solved with no need for any backtracking, any passing on all the possible combinations that I talked about,  but this is not the case when you use the back references.  The problem is that most applications are not using plain regex nowadays.  Most applications add the option of back reference to the regex engine.  Now this is a completely different story.  When using back reference, one cannot write a solution that is always efficient.  Back references add memory to previous states of regex, so the machine has to remember previous states.  This causes the automata or the engine to remember, making it dependent on the input, its size, structure, and also the regex structure itself.  A common use for back references is looking for repeated structures in the input which might be very useful.  The problem is, of course, performance, but it's very important to say that the user doesn't have to use back references, only the engine should be able to deal with it, and in order to do this the engine will have less efficient algorithm, so the user has some problems.  I think that today most regular expression engines use these options, although it is not plain regular expression.  This is a problem, because as I said, there is no efficient algorithm where this can be proved that can solve regular expressions that contain back references.

Jim Manico
 * What can we do as programmers to prevent a regular expression denial of service attack?

Adar Weidman
 * Well, I remember reading a paper by some guys from Google lately that says that the best way of preventing ReDoS is removing regular expressions from any place that should be secured. Of course, they said it is not really possible, but it is a very good place to start, so you should not use regular expressions, unless you really think you need it.  Many times, a simple code can be used instead of regex.  Another thing, you must be suspicious of any existing regex you get, from a repository from a friend in existing code.  Trust nobody, and check regex yourself for evil patterns, really simple.  In addition, you must never use regexes that are affected by user input, because users can inject evil patterns and use them on an input.  For instance, you can write a very simple program that accepts username and password, and checks using regex if the username is part of the password.  In this case, the user can enter an evil regex as a username and a problematic payload in the password field, and you are ReDoS'ed.  Last but not least, manufacturers will need to use superior algorithms for regex that will either limit the runtime and the memory used, or just use better algorithms that use backtracking only when there is back reference in the regex, and use efficient algorithms elsewhere.  In most of the cases, people do not use back reference, so it can be done efficiently.  As I said, some may already do it, but most of them don't.

Jim Manico
 * So Adar, can you tell us about what manual techniques we want to use to discover ReDoS in code review?

Adar Weidman
 * Manually, one has to look for evil patterns and strings that might be used for a regular expression search or validation. If the source code is not available, one can try to enter potentially dangerous payloads to input fields and if, at a certain point at least, the response time increases for every additional character added to the input, you are ReDoS'ed.  I guess good pin testers will be able to look for this vulnerability and successfully located it most of the time.  I currently do not know, unfortunately, of automatic tools to check ReDoS.  The static code analysis tools that I know do not have the ability to find these problems.  I am working on this, by the way, this very moment in our system.  I guess I'm not familiar with all the tools, so I might be missing something.  There are also security tools that strip JavaScript from any potentially dangerous code, and this also includes regular expressions, but it only eliminates all regexes rather than finding the evil ones, which is, as I have said before and the Google people have said before, this is not a solution…so I guess we still have a lot of work here.  Well, I am very glad to be on the show.  I just have to say that as much as I say don't use regular expressions, just today I used them about three or four, I think five or ten times myself, but at least in checking them, I make sure that there are no evil patterns in them.  Just remember, be careful with regular expressions.  They are dangerous.  Thank you.  Goodbye.