This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
Storage media images and files
BelkaCTF
Canterbury Corpus
The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.
Computer Forensic Reference Data Sets (CFReDS)
The Computer Forensic Reference Data Sets (CFReDS) project from NIST hosts sample cases that may be useful for examiners to practice with.
Digital corpora
Under an NSF grant, Kam Woods and Simson Garfinkel created a website for digital corpora. The site includes a complete training scenario, including disk images, packet captures and exercises.
Digital Forensic Research Workshop (DFRWS)
The Digital Forensic Research Workshop's rodeos and challenges.
Also see: DFRWS GitHub organization
Digital Forensics Tool Testing Images
Digital Forensics Tool Testing Images can be downloaded from Sourceforge.
ForensicsKB blog
Lance Mueller has created some disk images; they can be downloaded from his blog:
Honeynet Project
In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer.
The Honeynet Project: Challenges
Honeynet Project: Scans of the Month
The Honeynet Project provided network scans in the majority of its Scan of the Month challenges. Some of the challenges provided disk images instead. The Sleuth Kit's Wiki lists Brian Carrier's responses to those challenges.
Case Studies - Honeynet Challenges
Linux LEO
Barry Grundy created some disk images as parts of a Linux-based forensics tutorial.
Network Data Repository
UCI's Network Data Repository provides data sets of a diverse set of networks. Some of the networks are related to computers, some aren't.
Real Data Corpus
Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.
UMass Trace Repository
The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.
Network Packets and Traces
DARPA Intrusion Detection Evaluation
In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
- 1998 DARPA Intrusion Detection Evaluation
- 1999 DARPA Intrusion Detection Evaluation
- 2000 DARPA Intrusion Detection Scenario Specific
Wireshark
The open source Wireshark project (formerly known as Ethereal) has a website with many network packet sample captures.
NFS Packets
The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:
Other
Github user "markofu" has aggregated several other network captures into a Git repository.
Email messages
The Enron Corpus of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
The NIST TextREtrieval Conference 2007 has released a public Spam corpus:
Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)
Text Files
Log files
CAIDA collects a wide variety of data.
DShield asks users to submit firewall logs.
Text for Text Retrieval
The Text REtrieval Conference (TREC) has made available a series of text collections.
American National Corpus
The American National Corpus (ANC) project is creating a collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.
British National Corpus
The British National Corpus (100) is a 100 million word collection of written and spoken english from a variety of sources.
IEEE VAST Challenges
IEEE Visual Analytics Science & Technology (VAST) Challenges
Images
Object and Concept Recognition for Content-Based Image Retrieval UW Image Database. A set of freely redistributable images from all over the world, used for content-based image retrieval.
Voice
CALLFRIEND
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of \$600.
TalkBank
TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.
Augmented Multi-Party Interaction Corpus
The AMI Meeting Corpus has 100 hours of meeting recordings.
Other Corpora
- Daily Blog #277: Sample Forensic Images
- California Cybersecurity Institute - 2019 Digital Forensics Downloads
- Circl.lu - Digital Forensic - Training Materials
- Forensic Focus - Test Images and Forensic Challenges
- ForGe – Computer Forensic Test Image Generator, by Hunnu Visti, October 18, 2013
- Honeynet Project Challenges
- NullconCTF2014
- Digital Forensics Workbook - Data sets, by Michael Robinson, 2015
- Digital Forensic Challenge Images (Datasets), by Ali Hadi
- Linux Forensics - Workshops
- Second Look - Linux Memory Images
- Sony has made 60TB of Everquest 2 logs available to researchers.
- UT San Antonio Digital Corpora