Including what it calls the first production-scale malware research dataset available to the general public
In an effort to promote the use of AI-based systems in cyber security, British security software vendor Sophos is sharing datasets, tools, and methodologies in four separate areas.
These include research, protection methods, malware detection, and signature generation tools.
“With SophosAI’s new initiative to open its research, we can help influence how AI is positioned and discussed in cyber security moving forward,” said Joe Levy, chief technology officer of the company.
Sharing is caring
For research purposes, Sophos is sharing SOREL-20M (Sophos-ReversingLabs – 20 million), a production-scale dataset with metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples. Sophos calls the dataset, developed in a joint project with ReversingLabs, “the first production-scale malware research dataset available to the general public.”
A new AI-powered impersonation protection method is also being shared (discussed at Defcon here), for defense against email “spear phishing,” which can mimic trusted colleagues. The AI-based system was trained on a sample of millions of known attack emails.
For undetected malware, Sophos built a set of publicly available, epidemiology-inspired statistical models for estimating the prevalence of malware infections in total, enabling a better chance of discovery.
And finally, the company has developed and shared YaraML, an open sourced system for automatic signature generation, which “compiles” machine learning models of the kind used in commercial security products into signature languages.
“Today’s cacophony of opaque or guarded claims about the capabilities or efficacy of AI in solutions makes it difficult to impossible for buyers to understand or validate these claims. This leads to buyer skepticism, creating headwinds to future progress at the very moment we’re starting to see great breakthroughs,” Levy said.
“Correcting this through external mechanisms like standards or regulation won’t happen quickly enough. Instead, it requires a grassroots effort and self-policing within our community to produce a set of practices and language that will advance the industry in a disruptive, open and transparent manner.”