Data Access

The Allen Institute for Neural Dynamics (AIND) is committed to FAIR, Open, and Reproducible science. We therefore share all of the data we collect publicly with rich metadata as near to the time of collection as possible. We share data at all stages of the data lifecycle, including preliminary data collected during methods development, processed data that we are actively improving, or highly curated data used in a publication.

aind-open-data

In addition to sharing curated datasets with modality-specific NIH data archives like DANDI and BIL, we are also excited to share all of our data in one public S3 bucket generously hosted by the Registry of Open Data on AWS.

All data is stored here: s3://aind-open-data

Bucket Organization

AIND’s mission to share data early with full reproducibility has significant implications on how data and metadata should be organized and represented. Organizational principles include:

Immutability: once data is collected it should not be touched to ensure reproducibility.
Metadata Accessibility: metadata must be trivial to find and read for humans and machines.
Cloud Compatibility: data should be stored in cloud-friendly formats.

Based on these principles, s3://aind-open-data is organized as a flat list of data assets, where a data asset is simply a logical collection of files. Derived data assets are separate from their source assets, meaning they are not stored in the same path or directory.

Inspired by BIDS and the HCA metadata schema, metadata describing a data asset is stored as sidecar JSON files that live adjacent to the data they describe. These JSON files conform to the schemas defined in aind-data-schema.

As soon as possible, data in this bucket are stored in cloud-friendly formats, including NWB-Zarr for physiology and OME-Zarr for imaging. We aspire to produce data in these formats at the time of acquisition.

Naming Conventions

Raw data assets are named:

<modality>_<subject-id>_<acquisition-date>_<acquisition-time>

Derived data assets are named:

<source-data-asset-name>_<label>_<acquisition-date>_<acquisition-time>

A raw extracellular ephys asset would look like this:

    ecephys_123456_2022-12-12_05-06-07/
        ecephys/
            <ephys data>
        behavior/
            <video data>
        data_description.json
        subject.json
        procedures.json
        processing.json
        rig.json
        acquisition.json

A spike-sorting result asset would look like this:

    ecephys_123456_2022-12-12_05-06-07_sorted-ks25_2022-12-12_06-07-08/
        sorted/
            <sorted ephys data>
        data_description.json
        subject.json
        procedures.json
        processing.json
        rig.json
        acquisition.json

Benchmark Data

aind-benchmark-data/ephys-compression

Extracellular electrophysiology data is growing at a remarkable pace. This data, collected neuropixels probes by the Allen Institute for Neural Dynamics (AIND) and the International Brain Lab (IBL) can be used to benchmark throughput rates and storage ratios of various data compression algorithms.

All data is available within the aind-benchmark-data bucket at s3://aind-benchmark-data/ephys-compression.

The bucket is organized in subfolders containing datasets from different sources.

Experimental data

The following folders include experimental data sources:

aind-np2: Neuropixels 2.0 data collected by AIND (8 sessions)
aind-np1: Neuropixels 1.0 data collected by AIND (4 sessions)
ibl-np1: Neuropixels 1.0 data collected by IBL (4 sessions)

Each session (e.g. aind-np2/612962_2022-04-13_19-18-04_ProbeB) contains the raw traces in binary format as saved by SpikeInterface.

To load one session, you can use the spikeinterface library (see installation instructions):

import spikeinterface as si

recording = si.load_extractor("{local-path-to-aind-benchmark-data/ephys-compression}/aind-np2/612962_2022-04-13_19-18-04_ProbeB")

recording is a spikeinterface.BaseRecording object.

Simulated data

The mearec subfolder contains two simulated datasets using the MEArec simulator.

You can load a simulated session using spikeinterface (after installing MEArec):

import spikeinterface.extractors as se

recording_gt, sorting_gt = se.read_mearec("{local-path-to-aind-benchmark-data/ephys-compression}/mearec/mearec_NP1.h5")

recording_gt is also spikeinterface.BaseRecording object. sorting_gt is a spikeinterface.BaseSorting containing the ground-truth spiking data.