User talk:Abraham Kang


 * 1) sidebar TableOfContents

This page is under construction!

=Output Encoding=

Output encoding is the process by which characters which could be interpreted as directives or instructions within an output context are converted to their literal string equivalents. For example, basic output encoding in an HTML context would require "<", ">", "double quote", "single quote", and "&" to be converted to their HTML entity equivalents. Here are some examples:

In a SQL context "single quote" characters may get escaped to "’’" (double single quotes) or "\’" (back slash single quote). In a URL context, characters which could have special meaning in a URL (such as ?, &, =, #, etc.) are converted to their %2DigitHexEquivalent (eg: space becomes %20 or "+"). In all cases, characters which have special meaning in their output context are converted to a format that the interpreter/parser in the output context considers plain text (non-executable instructions or directives). Output encoding is one of the most effective ways to mitigate against cross-site scripting and other attacks. However, output encoding is a topic which has been over simplified. Today many Web 2.0/AJAX applications output data in non-trivial contexts. Non-trivial contexts are contexts which are composed of multiple or different contexts then the context which the data was originally placed in. Output encoding contexts can be categorized into the following contexts:

## Simple Context ## Mutating (Chameleon) Contexts ## Multi-Context Contexts ## Encoding Cleansing Contexts ## Reverse Encoding Contexts ## Lower-level Contexts

Simple Context
A simple context is one which does not change. The XSS (Cross Site Scripting) Prevention Cheat Sheet deals with these types (HTML, HTML attribute, URL, CSS, and JavaScript Contexts). A simple context is clearly visible and does not have any other contexts mixed in. For example in HTML page:

If the encoding is done correctly and there are unified and specified character sets, an attacker should not be able to escape out of this context.

Non-trivial Contexts
Simple contexts were common in request/response applications however most modern applications (Web 2.0/AJAX) have more complicated contexts. These contexts were caused by the need to execute JavaScript/VBScript from user generated events.

With the birth of Web 2.0/AJAX applications came the proliferation of client side Javascript/VBScript and the vulnerabilities associated with these scripting languages. JavaScript and VBScript are powerful languages having tight integration with HTML. HTML can execute script both from tags and HTML tag attributes. As a result, data placed in the script executing attributes can change their context from the original context it was placed in. One example of this is the Mutating (Chameleon context).

Mutating (Chameleon) Context
The Mutating (Chameleon) context is a context which looks like one context but really is another. It is best understood by looking at some examples:

Because of the src attribute, some developers might get confused and think the data is being output in a URL context. In this case, URL encoding "*userData*" does not do anything because it is implicitly reverse encoded by the browser (because we are encoding for the wrong context). In the example above, "userData" should have been JavaScript encoded because it is in a JavaScript context.

Another example is the following:

The *STYLE* attribute seems to indicate that *userData2* is placed in a CSS context but the *expression* method is utilized to call JavaScript. Using CSS encoding here would be not stop XSS from occurring.

In the examples above, utilizing JavaScript encoding would mitigate against XSS in the direct context which the output data was placed but does not mitigate against DOM based XSS. If userData is passed to a vulnerable JavaScript/VBScript function within the processing logic of the script methods being called, the encoded data could become exploitable. Because data which is passed into JavaScript/VBScript functions could theoretically be passed to multiple vulnerable contexts the next context type is called a multi context.

Multi Context
A multi context primarily occurs when user data is passed as parameters to a script function from a URL, event handler method, CSS expression method, or script context. In order to use the appropriate output encoding, the developer will need to trace the data flow of the untrusted data and make sure that the data does not flow to any vulnerable contexts (HTML context using document.write or innerHTML, URL context using window.location and the "javascript:" protocol, CSS context setting background URL attributes, JavaScript context using the eval method, etc.). In some cases if you notice that the parameter only is output in an HTML context using document.writeln you can HTML encode the data then JavaScript encode the data (in that order) before placing it in the page.

Let\’s look at an example to solidify this concept:

In order to stop JavaScript inlining in the "*Run Report*" link, the "reportName" request parameter needs to be JavaScript encoded. However, this may not be enough.

If the doCalcuations function looked like the following then JavaScript encoding would not be sufficient to mitigate against XSS.

If this was the only dangerous usage of "reportName" then the appropriate output encoding to use would be HTML encoding followed by JavaScript encoding. The JavaScript encoding would stop any attempts to inline code in the anchor href attribute and the HTML encoding would stop any XSS attempts in the document.writeln call of the doCalculations function.

But multi contexts can get even more complicated when user data passed in to a script function is output to possibly multiple contexts (URL and HTML). In this case, trying to use multiple nested encoding would likely break the application (URL encoding, then, HTML encoding, then JavaScript encoding as an example). In this case, we would have to revert to JavaScript encoding to prevent inlining of code in the anchor href attribute but then rely on additional JavaScript encoding methods to facilitate proper output encoding in script code (where the context is known). ESAPI4JS is a JavaScript library which provides JavaScript encoding methods to address DOM based XSS.

In the new doCalculations function above the proper encoding has taken place in a location where the context is readily discernable. This allows application developers to only have to focus on the primary context while the JavaScript developers deal with the context which they create. Ideally, using a JavaScript library to do output encoding makes sense. Although the example above seems a bit contrived. There are other contexts where a JavaScript encoding library would be required. The Encoding Cleansing Context is one of them.

Encoding Cleansing Context
The encoding cleansing context occurs when data is stored in a value attribute of an HTML element. Often times Web 2.0/AJAX applications store commonly used data or initialization data in the value attributes of hidden fields, textarea elements hidden with CSS, or directly in JavaScript variables. The problem is that data which is placed in the value attribute of a HTML element loses its HTML encoding when the value is later retrieved by DOM methods.

Given the following HTML, you are not directly vulnerable to XSS:

The value in the source is

However, if a DOM method was used to retrieve the value and write it out using a document.writeln you would have exploitable XSS.

The above line would cause a pop up with "123" as the string message.

Server side output encoding does not work here. The only choice is to utilize a JavaScript encoding library like ESAPI4JS and change to line above to the following:

In a similar vein to contexts which cleanse encodings, there are other contexts which will reverse encode certain encodings.

Reverse Encoding Contexts
Reverse encoding contexts are contexts which reverse encode specific encodings. The reason why these contexts are important is to understand the risks of using the wrong encoding for a certain context. In a good number of cases, using the wrong encoding may mitigate certain risks with the added side effect of breaking the app. For example, HTML encoding all of the output data in between tags may mitigate against XSS attacks but could break your JavaScript/VBScript (depending on which characters are encoded) code because the JavaScript/VBScript parser does not understand HTML encoding. In other cases, developers have the mistaken belief that HTML encoding will stop attacks in any context. Finally, some WAFs are configured to assume that HTML encoded data is safe allowing attackers to bypass a WAF by placing HTML and URL encoded content in contexts which will automatically reverse encode the data before executing it. The locations where reverse encoding occurs is dependent on the browser but with IE6 the tag’s src attribute will reverse HTML encode data. In all other browsers, the JavaScript event handler methods (on methods like onClick, onMouseOver, etc.) of HTML tags will reverse HTML encode data. CSS style expression attributes reverse encode CSS encoding. URL attributes ("src", "href", "window.location", etc.) will reverse encode HTML and URL encoded data after the "javascript:" protocol indicator.

Here are some examples which are exploitable:

Trying to utilize the wrong encoding for the given context is a hit or miss proposition. In addition, using the wrong encoding has a tendency to break your application. To properly output encode, use the correct encoding for the context where the data is placed and will eventually be output. If you think this is confusing you haven’t seen the worst. When implementing proper output encoding a developer needs to also consider how bytes are converted into characters given the character set of the output context.

Lower-level Contexts
Lower-level contexts deal with how a raw byte stream is converted to characters in the output context receiving the untrusted data. This problem typically occurs when there is a mismatch of character sets. An attacker can take advantage of the mismatch of character sets or lack of specifying a character set to bypass your encoding routines. Let’s look at an example of output encoding for sql injection given a character set mismatch. Some libraries will escape single quotes by prepending the single quote with a back slash. Normally this isn’t a problem but it becomes a problem when there is a character set mismatch and a character in the output character set ends with the same character used to escape dangerous characters. The English language is pretty compact because the letters and numbers can fit in 62 values. All of the other characters (punctuation and control characters) can be fit into 127 values (ASCII) for English--if you add the western European languages you can fit all of the letters used into 255 values (ISO-8859-1). Because all of the English and western European letters fit within an eight bit representation, the characters are represented with two hexadecimal values. 5c for example is a back slash. Asian character sets are comprised of thousands of symbols and characters. In order to address all of the characters, each Asian character is represented by a 16 bit value (represented by 4 hexadecimal values). In most of these character sets there is backward support of the single byte ASCII values as single bytes. One example of an Asian character set which exhibits this behavior is the GBK character set. In the GBK character set *0xbf5c* represents a valid single Asian character (*縗*) and *27* (hex) by itself represents a single quote. If the application server is running an 8 bit character set then for simplicities sake let’s assume that the application server will look at two hexadecimal values and convert it to a character. Once all of the incoming data has been parsed, the application code will send the data to your encoding routines. If an attacker sends *0xbf27…* to your application running the 8-bit character set the character representing bf will most likely be represented as an unprintable character represented by *¿*. This character is not recognized as a dangerous character and will be passed through the output encoding routine untouched but when it sees the single quote character (*27* in hex) it will prepend a "\" (*5c* in hex) before the *27*. This results in *0xbf5c27…*. When passed as a last name to a database running the GBK character set the query will look like the following:

`Select * from Users where last_name = ’`*縗*`’ or 1=1.`

The reason this occurred is that *0xbf5c* is the character for *縗*. In the GBK character set ASCII characters are preserved as single byte representations so the *27* was interpreted as a *single quote*. To summarize, the encoding routines were looking at the characters using one character set and the database interpreted the bytes using a different character set. In the case where a character set is not explicitly defined, an attacker can pass in bytes which look benign in the encoding routines character set but get converted to dangerous characters in the output context. For example, let’s take the example were the application does not specify a character set and the page looks like the following:

The  tag encodes the "<", ">", "&", "double quote", and "single quote" characters. Normally that is sufficient to stop XSS in an HTML context. However, some browsers will try to guess the encoding based on the output characters in the page when no character set is specified by the response page. If the attacker were to send in the following as "*userData*" what would happen:

This would get by the  tags encoding routine and execute UTF-7 encoded script tags. In general to avoid this you will need to enforce a unified character set across all of your application components and tiers.

Using the Right Tools for Encoding
There are many encoding libraries out there in the wild and some are more secure than others. When evaluating an encoding library you will need to understand several criteria to determine if the encoding library is secure.

What Characters Does It Encode and What Approach Does It Take
Ideally you are looking to see which characters are encoded for each context. If the encoding library is using a backlist approach then you want to compare the encoding libraries to see which library encodes more characters. Ideally the encoding library which takes a white list approach (encodes all characters except numbers and letters) would be preferred over an encoding library which uses a black list approach (only encodes a specified set of characters). ESAPI and ESAPI4JS are two well written implementations which can be used for comparison purposes.

Standardize the Encoding Library Across the Enterprise
Once an encoding library has been selected ensure that all applications developed in your organization use the encoding library consistently. This will reduce the mistakes made by developers who do not understand all of the implications of improper output encoding.

Where to Utilize Output Encoding
Some developers get confused and try to output encode all incoming request parameters including data before it is stored to the database. This will corrupt the data in the database if it is used in a context other than a browser. The best practice is to output encode as close as possible to the context you are trying to address. Ideally this would require the JavaScript developers to output encode using ESAPI4JS just before calling document.writeln or setting the innerHTML attribute. Server side developers would then need to only output encode for the primary context of the outputted data instead of using multiple encodings.

Summary
Output encoding is not a simple endeavor. Getting it correct requires a developer to understand all of the moving pieces to make sure that the correct encoding is being applied to the appropriate context. Remember that you want to understand where untrusted data is being output/used, how your encoding is working, and how your environment processes characters. You also want to make sure your selected encoding library consistently. Good luck and email me (abraham.kang@owasp.org) with any questions.