GSoC2013 Ideas/OWASP ZAP CMS SCANNER

Status: in writing (24-07-2013) WebApp Scanner on GIT: https://github.com/abdelhadi-azouni/zap-cmss-extension

Introduction
Latest Stats Show that the usage of CMS has grown in the last 5 years Just WordPress and Joomla occupy more than 6% of the top 1 million site The Usage Has Grown in Both Corporate and Personal sites And with the Chaotic Development of plugins and Components in those CMSs The risk of vulnerabilities and flows increase more and more



OWASP ZAP CMS SCANNER (ZAP CMSS) is a Scanner with More specified search methods

Functionalities
- Enumerating Plugins and Components and Themes in the CMS (Passive and Aggressive Search methods)

- Enumerating from page content

- Enumerating from lists (or database)

- Version Fingerprinting (with multiple methods) - Labor intensive to add signatures

- Manually locate version in files or build regexes for headers

- Built-in options to remove identifiers (eg, meta generator)



- Enumerating Vulnerable plugins, themes and Components

- Enumerating from a well-known list

- Enumerating from web search

- Enumerating using the ZAP api

- Enumerating Security measures (firewalls, security plugins ...)



Matches
Matches are made with:

- Text strings (case sensitive)

- Regular expressions

- Google Hack Database queries (limited set of keywords)

- MD5 hashes

- URL recognition

- HTML tag patterns

- Custom java code for passive and aggressive operations

Features
- Control the trade off between speed/stealth and reliability

- Plugins include example URLs

- Performance tuning. Control how many websites to scan concurrently

- Result certainty awareness

- Fast

- Low resource usage

- Accurate (Low FP/FN)

- Resistant to hardening/banner removal

- Super easy to support new versions/apps

ZAP CMSS modules
ZAP CMS scanner extension consists of four main modules:

1- CMS detection module
mains to indicate what CMS uses the application of given url, two methods are used with CMS detector :

a- Passive search
based on page content analysis : 1 - using text string (case sensitive) 2 - using regex

these two methods are used to recognize html tags, eg: meta tag Generator, or to extract texts from the page showing the tool with which this application is created

3 - using google hacks (from a list of predefined keywords)

b- Aggressive search
Is to try known and unique URLs from a predefined list, the presence of these paths indicates with certainty the CMS used, the problem with this method is that it will not be very effective in if several CMSs are supported, this because of the absence of a single file in this case

2- Plugin enumerating module
The purpose of this module is to list the plugins used in the application of the given url, both passive and active methods are always possible

a- Passive search: analyzing the page content using regex

b- Aggressive research: using a list of names of plugins, which will be compared with any names found in specific URLs, for example, WordPress plugins in : url + / wp-content/plugins /

3- Fingerprinting module
Used to identify with controllable accuracy - overlooked used method and processing time - the version of CMS and / or plugins used, the joint research passive / aggressive is used, using

- Content analysis of some specific file, example: a readme file contains the following information: "package to Version 3.0.x"

- A list / database that contains unique file-links / paths that determines version of CMS / plugin

- A list / database that contains the correspondence : plugin version / filePath / hashDegest, so after you have verified the presence of a unique name file, but not unique content file, comparing its digest with that present in the database, the result indicates the component version here is an example of WordPress versions database :



4- Vulnerabilities Enumeration module
Used to give a list of vulnerabilities that contains a given plugin, this module is called after the steps of sensing the CMS and plugin enumerating, this module use :

1- database that contains the correspondence between name-version-plugin / vulnerability-list

2- web search based on a list of links to useful sites

Transition to WebApp Scanner
after pushing the research of fingerprinting techniques, and advanced in detailed design, it is apparent that is more appropriate and useful to go wider, and work on web applications in general and not only CMSs. I spoke with Simon about it, and he was pleased with the proposal, so I started the implementation of the web application finger printer core methods based on existing tools such as BlindElephant and Wappalyzer.

Passive search
is to look in the target application (HTML, HTTP headers content ...) patterns that determine how likely or definite name and version of the technology used to make this application, passive research does not change anything in requests nor in content.

Used techniques
web page content analysis: is to look for patterns in the HTML document, the scan tool will download the web page and perform patterns search using regex, usually from predefined lists that are updated. Several patterns may indicate the technology used to make the application: Here are the indicators:

- The HTML content itself: by performing specific regex patterns on the page content, we can determine with high probability the technology used to build the application, for example:the pattern : "Powered by (: ] + cs-cart \ \ com | CS-Cart?)"..

- The meta tag Generator: for a given app, if the following regex pattern performed :       "? WordPress ([. \ \ D] +) \ \; Version: \ \ 1"  on the content (“content”) of the meta tag named : "generator", then if successful research therefore the application is, with high probability, carried out with the indicated version of WordPress.

most WordPress websites can be identified by the meta HTML tag, e.g. , but a minority of WordPress websites remove this identifying tag but this does not thwart our scanner, other tests are implemented to reveal a Wordpress application.

- The script tag: its contents may contain the name of the implementation technology, the search is always done with regex patterns.

This method is suitable for most applications and gives good results in a very acceptable time, but it is sensitive to changes in the code of the page code sensitive, as it is always possible to remove these indicators.

existing tools
among tools the most efficient and best known tools that uses passive research: Wappalazer, it was originally a Firefox and Chrome extension in javascript, but It was then rewritten in several languages like python. Wappalyzer uses a long list of regex structured by type of application and type of pattern (name tag, headers ...) in a JSON file.

home page : http://wappalyzer.com/ GIT repo: https://github.com/ElbertF/Wappalyzer

WhatWeb another webapps fingerprinting tool which also uses the passive research, but not so deep as Wappalyzer.

home page: http://www.morningstarsecurity.com/research/whatweb GIT repo: https://github.com/urbanadventurer/WhatWeb Here is a brief comparison to other fingerprinting tools: https://github.com/urbanadventurer/WhatWeb/wiki/Related-Projects

our implementation
we implemented a similar code to Wappalyzer and it uses the same JSON         list-of-regex file, our tool connect firstly to the target’s URL, then it downloads a web DOM document, it contain the HTML content of the page, than the program applies regex patterns (by type), by reading them once from the JSON file into an object of JsonObject class.

== perspectives: == implement other methods of passive search, those present in Whatweb: access to htaccess

dependencies : 	json-simple-1.1.1 : JSON parser, jsoup1.7.2 : HTML parser

aggressive search
consists on Bruteforcing target host in search of indicators files, the presence of such a file reveals a certain technology or even its version as well, but in most cases, the version is set after comparing the MD5 digest of the content of the file found on the host with the present in a pre-built database.

used techniques
File and Folder Presence (HTTP response codes): This approach doesn't download the page however it starts looking for obvious trails of an application by directly hitting the URL and in course identifying found and not found application list. In starting days of internet this was easy, just download headers and see if it’s 200 OK or 404 not found and you are done.



However in current scenario, people have been putting up custom 404 Pages and are actually sending 200 OK in case the page is not found. This complicates the efforts and hence the new approach is as follows. Download default page 200 OK. Download a file which is guaranteed to be non-existing then mark it as a template for 404 and then proceed with detection logic. Based on this assumption and knowledge this kind of tools start looking for known files and folders on a website and try to determine the exact application name and version. an example of such scenario would be   wp-login.php => wordpress /owa/ => Microsoft outlook web frontend.

reference : http://anantshri.info/articles/web_app_finger_printing.html

Checksum Based identification :This is relatively a newer approach and the most accurate one for now. This Technique basically works on below pattern. 1) Create checksum for files locally and store them in DB 2) Download static file from remote server 3) Create the checksum 4) Compare with the checksum stored in db and identify the version One of the best implementation of this technique is BlindElephant

reference: http://anantshri.info/articles/web_app_finger_printing.html

existing tools
BlinElephant: among the most robust implementations Checksum Based identification. It attempts to discover the version of a (known) web application by comparing static files at known locations against precomputed hashes for versions of those files in all all available releases. The technique is fast, low-bandwidth, non-invasive, generic, and highly automatable. BlinElephant is absolutely Checksum verification oriented home page: http://blindelephant.sourceforge.net/ Sourceforge repo: http://sourceforge.net/projects/blindelephant/

our implementation
we wrote a similar code to BlindElephant (which is in python), using the same updated database (available on its Sourceforge repo). The files of the database are PKL format which is a Python specific format, it’s a Python object serialization format. So in order to process this data in Java, we needed to convert it to a universal format, so we chosen to convert it into XML files using a small Python script. every webapp has its own XML file, each XML file has the following format:

the resulting output (name and version of the webapp or technology) is discovered according to the following schema:

As described by author at its home page, The Static File Fingerprinting Approach in One Picture

dependencies : 	jdom-2.0.5: un parser XML,

Component enumeration and fingerprinting
the enumeration Is to provide a list of all components, modules, plugins or themes related to the target’s application technology. The fingerprinting is to detect the version of the module already detected, the two processes (enumeration and fingerprinting) are generally parallel: to each detected component, it will be applied the fingerprinting.

why enumerate and fingerprint plugins? the enumeration can detect the presence of vulnerable plugins and / or components, which allows the attacker, or pen-tester, directly targeting known vulnerabilities and fix them.

This process is, especially, very effective in CMSs case. In fact, most CMSs are based on plugins and extensions system,the ones they are vulnerable are generally well known and classified into vulnerability databases.

passive search
pending

existent tools
DPscan for Drupal (https://github.com/cervoise/DPScan) (				our implementation:

aggressive search
aggressive research (applied to components and extensions listing) is to Brute Force components and extensions paths in the URL of the target’s application, using a path lists. The result (presence or absence of such a component) is obtained by analyzing the returned HTTP code. after you extract the name of the existing component in the application, we need to identify the version of the component, this process is not really based on general technique, we are going to use a variety of techniques like fetching in the readme file. Example: if the README file is present and accessible in a WordPress component, you can easily open and read the Stable tag part, obtained by applying the regex pattern "Stable tag: (+.)" on this file.

used techniques
bruteforce site content by using a list of paths of known components. This is for the detection of plugin itself. To detect its version, varied methods are possible and depends on the component and the webapp technology.

existent tools
OWASP ODZ MultiCMS Scanner (https://www.owasp.org/index.php/OWASP_Odz_MultiCMSScanner), (https://github.com/islamoc/odz) the majority of enumeration tools are specific, eg Joomscan OWASP for the Joomla CMS (http://sourceforge.net/projects/joomscan/), WordPress WPscan (http://wpscan.org/) and WordPress also Plecost (https://code.google.com/p/plecost/) Some tools work on a combination of CMSs or/and webapps. Eg:

- CMS-explorer: written in Perl, designed to reveal the the specific modules, plugins, components and themes that various CMS driven web sites are running. CMS Explorer currently supports module/theme discovery with the following products: Drupal Wordpress Joomla! Mambo

Project home on Sourceforge: https://code.google.com/p/cms-explorer/

- WebSorrow: a perl based tool for misconfiguration, version detection, enumeration, and server information scanning. It's entirely focused on Enumeration and collecting Info on the target server. Web-Sorrow is a "safe to run" program, meaning it is not designed to be an exploit or perform any harmful attacks.

Project home on Sourceforge: https://code.google.com/p/web-sorrow/

CMS-explorer vs WebSorrow

Concerning database, they use the same list of plugins, CMS-explorer also uses a list of themes and WebSorrow not. Both deal Joomla, WordPress and Drupal. WebSorrow make ​​more detection and work. Both looks like they use the “fuzzDB” database, which is updated and contains many other files of other types of audits. fuzzDB project home on Sourceforge: https://code.google.com/p/fuzzdb/

our implementation
passive search not implemented yet

for aggressive research, we hesitated between using existing databases and creating our own. Then, we decided to use the existing ones, but combining files between various existing tools, so we decided to implement a modular system: for each CMS or webapp, we create a module (of component detection) because we figured out that the effort to unify the formats of databases is more than creating a module for each Webapp, in addition to the specific treatment in the extraction of each Webapp version.

our implementation
Other sources very useful : https://github.com/urbanadventurer/WhatWeb/wiki/How-to-develop-WhatWeb-plugins http://resources.infosecinstitute.com/prototype-model-web-application-fingerprinting/

WebApp Scanner on GIT : https://github.com/abdelhadi-azouni/zap-cmss-extension