Meeting Notes

Agenda

@rcpeene will provide an in-depth introduction to our currently available dataset on S3.

He will walk through our image and electrophysiological datasets available today and explain derived assets created and available.

Meeting Recording

Meeting Notes

Data Structure and Processing Modalities: Carter, Alessio, and Arielle provided a comprehensive overview of the data structure, processing pipelines, and modalities (Neuropixels, Mesoscope, Slap2), detailing how raw and processed assets are organized and accessed, with input from Jerome and questions from Farzaneh and Marcel.

Modality Processing Pipelines: Carter explained that each modality—Neuropixels, Mesoscope, and Slab Two—has a distinct processing pipeline, starting from raw assets uploaded to Code Ocean. Neuropixels sessions undergo spike sorting with Alessio's pipeline, followed by NWB creation and upload to DANDI. Mesoscope sessions involve three processed assets per raw session, each corresponding to different sub-modalities, which are then combined into a completed NWB. Slap2 produces a processed asset and completed NWB, but its pipeline is still under development.

Session Identification and Asset Naming: Each session is identified by a mouse ID and date-time, forming the asset name on S3. Carter clarified that for Mesoscope, four processed assets are expected per session, while Ephys sessions have two. Slap2 uses a newer schema, resulting in different naming conventions.

Behavior Processing for Modalities: In response to Farzaneh's question, Carter explained that behavior processing for Ephys and Slap2 is performed during NWB generation, mainly syncing behavioral and recording information, including running speed and stimulus tables.

Quality Control and Data Status: Carter noted that quality control is ongoing for all assets, with some files potentially being reprocessed or excluded if they fail internal QC. Neuropixels are still undergoing brain imaging, and Mesoscope files may have truncated running data.

Data Access via S3 and Code Ocean: Carter and Jerome described how the team can access data assets through Amazon S3, with public availability via the Quilt tool, and discussed the potential for Code Ocean account setup for deeper access to processing pipelines and compute resources, addressing questions from Farzaneh and others.

S3 Bucket Structure and Access: Carter explained that data is hosted on Amazon S3, accessible through public tools like Quilt, and that each asset corresponds to a session ID from the tracking spreadsheet. Users can search for session IDs in the S3 bucket to locate raw and processed assets.

Code Ocean Functionality: Code Ocean is described as a platform for hosting data, code, and compute environments, with GitHub integration. While blanket access is not possible, Carter and Jerome offered to set up accounts for interested team members, enabling access to internal processing pipelines and code capsules.

Public vs. Internal Access: Farzaneh clarified confusion about needing an AWS account, and Carter confirmed that public access to the S3 bucket does not require AWS credentials. Internal compute or code access would require Code Ocean accounts.

Navigating S3 and Quilt: Carter and Jerome guided Farzaneh through the process of searching for session IDs within the S3 bucket using Quilt, clarifying the difference between bucket-level and global searches and ensuring users can locate relevant data assets.

Metadata and File Structure: Carter detailed the metadata JSON files accompanying each asset, describing their contents and differences between raw and processed assets, and explained the structure of behavior, recording, and processed data folders, with additional clarification from Alessio and Arielle.

Metadata JSON Contents: Each asset contains metadata JSON files describing rig configuration, subject details, procedures, processing steps, data description, and session information, including reward consumption, session times, and stimulus details.

Behavior and Recording Folders: Raw session assets include folders for behavior videos (MP4s of mouse activity), behavior data (stimulus tables, synchronization), and modality-specific recordings (e.g., Ephys compressed and clipped files). Processed assets contain NWB files and additional metadata.

Processed Asset Structure: Processed assets for Mesoscope and Ephys include NWB files with partial data, quality control folders (drift maps, firing rates), visualization outputs, and links for interactive exploration. Slab Two assets are simpler, mainly containing raw imaging data and metadata.

NWB Files and Data Analysis Tools: Carter, Alessio, and David discussed the contents and accessibility of NWB files, including spike times, LFP, and behavioral data, and provided guidance on streaming, downloading, and analyzing NWB files, responding to questions from Marcel and Farzaneh.

NWB File Contents: NWB files contain processed data such as spike times, local field potentials (LFP), and quality metrics. Alessio confirmed that all columns and metrics visible in the portal are also saved in NWB files.

Streaming and Downloading NWB Data: Alessio explained that streaming NWB files allows users to access only the required data (e.g., spike times) without downloading the entire file, which is efficient for large datasets. Carter's script facilitates this process.

Partial Data Access and File Size: David noted that tools exist to read only parts of NWB files, and Alessio clarified that NWB files are downsampled and typically around 250 MB, containing spike times and LFP but not raw data, which resides in separate assets.

DANDI Integration Plans: Marcel asked about DANDI access, and Carter confirmed that the goal is to eventually release complete NWB files on DANDI with fixed DOIs, but for now, most sessions are only available on S3.

Data Labeling, Curation, and Classification: Alessio answered questions from Sarah and Jerome about data labeling, curation, and classification criteria, describing automated and manual curation tools, classifier training, and the meaning of labels such as SUA, MUA, and PSUA.

Label Definitions and Curation: Alessio explained that 'noise' indicates non-neural activity, 'MUA' is multi-unit activity, 'SUA' is single unit activity, and 'PSUA' is putative single unit activity. Automated labeling is performed, and manual curation tools are being developed for external use.

Classifier Training: A pre-trained classifier distinguishes noise from neural activity and further classifies single versus multi-unit activity, based on curated data from the Allen Institute.

Example Scripts and Documentation: Carter and Jerome responded to requests from Sarah and Farzaneh for example scripts and documentation, agreeing to share notebooks and file structure guides to facilitate data access and analysis.

Notebook and Script Availability: Carter mentioned existing notebooks for DFF calculation and NWB analysis, and offered to create new examples for multi-plane OFIS and Slab Two data, posting links in the chat and forum.

Documentation Sharing: Jerome confirmed that documentation on file structure will be shared to help users navigate directories and understand data organization.

Future Plans and Data Release Paper: Jerome proposed that the next meeting focus on organizing efforts for the data release paper, encouraging the team to explore the dataset and prepare first-order figures, with agreement from Carter and Karim.

Data Release Paper Preparation: Jerome suggested that the team use their access to the dataset to begin drafting figures and content for the data release paper, planning to discuss organization and next steps in the following meeting.

Forum Support and Questions: Carter encouraged team members to post questions about the data on the forum, promising attentive support to help users become familiar with the dataset.