Sophos shares data and tech to advance the state of AI in cyber security

Including what it calls the first production-scale malware research dataset available to the general public

Chuck Martin

December 15, 2020

2 Min Read

Including what it calls the first production-scale malware research dataset available to the general public

In an effort to promote the use of AI-based systems in cyber security, British security software vendor Sophos is sharing datasets, tools, and methodologies in four separate areas.

These include research, protection methods, malware detection, and signature generation tools.

“With SophosAI’s new initiative to open its research, we can help influence how AI is positioned and discussed in cyber security moving forward,” said Joe Levy, chief technology officer of the company.

Sharing is caring

For research purposes, Sophos is sharing SOREL-20M (Sophos-ReversingLabs – 20 million), a production-scale dataset with metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples. Sophos calls the dataset, developed in a joint project with ReversingLabs, “the first production-scale malware research dataset available to the general public.”

A new AI-powered impersonation protection method is also being shared (discussed at Defcon here), for defense against email “spear phishing,” which can mimic trusted colleagues. The AI-based system was trained on a sample of millions of known attack emails.

For undetected malware, Sophos built a set of publicly available, epidemiology-inspired statistical models for estimating the prevalence of malware infections in total, enabling a better chance of discovery.

And finally, the company has developed and shared YaraML, an open sourced system for automatic signature generation, which “compiles” machine learning models of the kind used in commercial security products into signature languages.

“Today’s cacophony of opaque or guarded claims about the capabilities or efficacy of AI in solutions makes it difficult to impossible for buyers to understand or validate these claims. This leads to buyer skepticism, creating headwinds to future progress at the very moment we’re starting to see great breakthroughs,” Levy said.

“Correcting this through external mechanisms like standards or regulation won’t happen quickly enough. Instead, it requires a grassroots effort and self-policing within our community to produce a set of practices and language that will advance the industry in a disruptive, open and transparent manner.”

Stay Ahead of the Curve
Get the latest news, insights and real-world applications from the AI Business newsletter

You May Also Like