We share a number of software tools and datasets with the research community. The listed items below have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Please contact me if the links become unreachable.
BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information: 57,293 malware samples and 77,142 benign samples collected from 2019 to 2020 (581 families). Download Link. Related paper: [DLS'21].
We open-source the code for running DNS-over-HTTPS (DoH) performance measurements (resolution time) using the BrightData network. We also share the dataset collected in 2021. Download Link. Related paper: [IMC'21].
We release a measurement dataset on VirusTotal label dynamics: Download Link. The dataset contains the daily snapshots of VirusTotal labels for more than 14,000 files (including a subset of manually verified ground-truth) from 65 VirusTotal engines over a year. The related paper is: [USENIX Security'20], and [IMC'19].
We open source PCI-checker, a light-weight scanner dedicated to checking the compliance of e-commerce websites concerning the Payment Card Industry Data Security Standards (PCI-DSS). The code and labeled datasets are available here: Download Link. We also open source BuggyCart, a customizable testbed to assess the performance of PCI vulnerability scanners. The code is avaulable here: Download Link. The related paper is: [CCS'19].
We open source a tool called VIEM, which is designed to process unstructured information in public security vulnerability reports (e.g., those referenced by the CVE website or the NVD database). The tool can extract vulnerable software names and versions from unstructured reports. The code and labeled datasets are available here: Download Link. The related paper is: [USENIX Security'19].
We open source a tool called LEMNA, which is a high-fidelity explanation method dedicated for security applications (particularly for RNN). Given an input data sample, LEMNA generates a small set of interpretable features to explain how the input sample is classified. The core idea is to approximate a local area of the complex deep learning decision boundary using a simple interpretable model. The local interpretable model is specially designed to (1) handle feature dependency to better work with security applications (e.g., binary code analysis); and (2) handle nonlinear local boundaries to boost explanation fidelity. The code is available here: Download Link. The related paper is: [CCS'18].
We study the feasibility of adversarial attacks in the physical world. Our core idea is to use an image-to-image translation network to simulate the digital-to-physical transformation process for generating robust adversarial examples. To validate our method, we conduct a large-scale physical-domain experiment, which involves manually taking more than 3000 physical domain photos. The results show that our method outperforms existing ones by a large margin and demonstrates a high level of robustness and transferability. Please check out the [Project Website] for the dataset details. The code and sample data are available here: Download Link. The related paper is: [AAAI'19]. If you need the full dataset, please drop me an email.
To the reproducibility of crowd-reported security vulnerabilities, we collected and analyzed 368 memory corruption vulnerabilities discovered from 2001 to 2017. To facilitate future research, we share our full dataset with the research community. The dataset includes 291 vulnerabilities with CVE-IDs and 77 vulnerabilities without CVE- IDs. For each vulnerability, we have filled in the missing pieces of information, annotated the issues we encountered during the reproduction, and created the appropriate Dockerfiles for each case. Each vulnerability report contains structured information fields (in HTML and JSON), detailed instructions on how to reproduce the vulnerability, and fully-tested PoC exploits. In the repository, we have also included the pre-configured virtual machines with the appropriate environments. To the best of our knowledge, this is the largest public ground-truth dataset of real-world vulnerabilities which were manually reproduced and verified. You can check out the dataset here under this Download Link. The related paper is: [USENIX Security'18].
To study the security threat of leaked passwords from data breaches, we have collected 107 password datasets leaked during 2008 to 2016 (e.g., LinkedIn, MySpace, Adobe, Ashley Madison). We linked users across different datasets to study password reuse and modification patterns. In total, the dataset covers 28.8 million users and their 61.5 million passwords over 8 years. Please find more details at the [Project Website]. The related paper is: [CODASPY'18].
Datasets collected from a social livestreaming service (Twitter's Periscope) in 2015. The dataset contains This dataset contains 13,894,852 broadcasts and in total 416,207,256 comments, 6,101,042,415 hearts and other detailed interaction metadata. Please check out the [Project Website] for the dataset details. The related paper is: [IMC'16].
In this project, we build an unsupervised system to capture dominating user behaviors from clickstream data (traces of users' click events), and visualize the detected behaviors in an intuitive manner. The system identifies "clusters" of similar users by partitioning a similarity graph (nodes are users; edges are weighted by clickstream similarity). The partitioning process leverages iterative feature pruning to capture the natural hierarchy within user clusters and produce intuitive features for visualizing and understanding captured user behaviors. The code and sample data are available at the [Project Website]. Related papers are: [CHI'16], [TWEB'17], [USENIX Security'13].
To evaluate the performance of de-anonymization algorithms on real-world datasets, we re-implemented 7 de-anonymization algorithms published in the last 10 years: POIS [WWW'16], ME [AIHC'16], HIST [TIFS'16], WYCI [WOSN'14], MSQ [TON'13], HMM [IEEE SP'11], NFLX [IEEE SP'08]. We also introduced 3 new algorithms into this collection. The code and sample data is available at [Github]. The related paper is: [NDSS'18].