Dive Brief:
- There's no such thing as perfect training data, but cybersecurity experts are gaining access to more benchmark datasets to develop malware detection machine learning models. A dataset launched by Endgame on Monday includes 1.1 million hash values of portable executable files scanned last year by VirusTotal as well as metadata from the files.
- Researchers can test machine learning models against a benchmark model trained on the data set. About 900,000 of the samples are training samples split equally between benign, malicious and unlabeled programs, and the remaining 200,000 are test samples split evenly between malicious and benign programs.
- By comparing training data against test data, researchers can see how well models detect previously unseen malware. But the data set only serves as a starting point for future research and comparisons of performance data, not as a cybersecurity solution. For intellectual property reasons, the collection does not include the actual files, only SHA-256 hashes (32-bit cryptographic hash functions that serve as a signature for a file).
Dive Insight:
Working together — whether through commercial agreements or simply making a dataset available for training and research purposes — can bolster a business community ravaged by the attacks of 2017. On Tuesday, for example, 34 companies including Microsoft, Oracle and Facebook signed the Cybersecurity Tech Accord, publicly committing to protect internet users, work together and improve resilience in the space.
But outside of large-scale initiatives, the basics, such as malware detection, have a long road ahead as the cyberattacks keep rolling in.
Advancements in AI and ML on the enterprise side are important to counter hackers also utilizing the technology to automate attacks. The most effective kind of malware is a strain that hits without a business ever knowing, but advanced detection capabilities harnessing AI and ML are steadily helping cybersecurity teams overcome the odds.
But without good data, good defensive and detection measures are hard to build.
In the same vein as cybersecurity, other companies are working to build out massive data sets for image recognition ML models. Google is taking a decentralized approach and having users all around the world help classify images across geographic and cultural boundaries, and researchers are using IBM's data set of 1 million short video clips to help classify simple actions.
But most companies are struggling to train model with incomplete or insufficient data sets. AI experts have urged members of Congress to enact open data policies so that researchers can access the troves of government data, sitting unstructured and unused for years, to derive new insights.