exploratory-data-analysis

davila7/exploratory-data-analysis

Data & Analytics

19,892

About

SKILL.md

exploratory-data-analysis

davila7/exploratory-data-analysis

Data & Analytics

19,892

About

Perform comprehensive exploratory data analysis on scientific data files across 200+ file formats...

SKILL.md

Exploratory Data Analysis

Overview

Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning.

Key Capabilities:

Automatic detection and analysis of 200+ scientific file formats
Comprehensive format-specific metadata extraction
Data quality and integrity assessment
Statistical summaries and distributions
Visualization recommendations
Downstream analysis suggestions
Markdown report generation

When to Use This Skill

Use this skill when:

User provides a path to a scientific data file for analysis
User asks to "explore", "analyze", or "summarize" a data file
User wants to understand the structure and content of scientific data
User needs a comprehensive report of a dataset before analysis
User wants to assess data quality or completeness
User asks what type of analysis is appropriate for a file

Supported File Categories

The skill has comprehensive coverage of scientific file formats organized into six major categories:

1. Chemistry and Molecular Formats (60+ extensions)

Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.

File types include: .pdb, .cif, .mol, .mol2, .sdf, .xyz, .smi, .gro, .log, .fchk, .cube, .dcd, .xtc, .trr, .prmtop, .psf, and more.

Reference file: references/chemistry_molecular_formats.md

2. Bioinformatics and Genomics Formats (50+ extensions)

Sequence data, alignments, annotations, variants, and expression data.

File types include: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, .mtx, and more.

Reference file: references/bioinformatics_genomics_formats.md

3. Microscopy and Imaging Formats (45+ extensions)

Microscopy images, medical imaging, whole slide imaging, and electron microscopy.

File types include: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff, and more.

Reference file: references/microscopy_imaging_formats.md

4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)

NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.

File types include: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif (crystallography), .wdf, and more.

Reference file: references/spectroscopy_analytical_formats.md

5. Proteomics and Metabolomics Formats (30+ extensions)

Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.

File types include: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad, and more.

Reference file: references/proteomics_metabolomics_formats.md

6. General Scientific Data Formats (30+ extensions)

Arrays, tables, hierarchical data, compressed archives, and common scientific formats.

File types include: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml, and more.

Reference file: references/general_scientific_formats.md

Workflow

Step 1: File Type Detection

When a user provides a file path, first identify the file type:

Extract the file extension
Look up the extension in the appropriate reference file
Identify the file category and format description
Load format-specific information

Example:

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

Step 2: Load Format-Specific Information

Based on the file type, read the corresponding reference file to understand:

Typical Data: What kind of data this format contains
Use Cases: Common applications for this format
Python Libraries: How to read the file in Python
EDA Approach: What analyses are appropriate for this data type

Search the reference file for the specific extension (e.g., search for "### .fastq" in bioinformatics_genomics_formats.md).

Step 3: Perform Data Analysis

Use the scripts/eda_analyzer.py script OR implement custom analysis:

Option A: Use the analyzer script

# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report

python scripts/eda_analyzer.py <filepath> [output.md]

Option B: Custom analysis in the conversation Based on the format information from the reference file, perform appropriate analysis:

For tabular data (CSV, TSV, Excel):

Load with pandas
Check dimensions, data types
Analyze missing values
Calculate summary statistics
Identify outliers
Check for duplicates

For sequence data (FASTA, FASTQ):

Count sequences
Analyze length distributions
Calculate GC content
Assess quality scores (FASTQ)

For images (TIFF, ND2, CZI):

Check dimensions (X, Y, Z, C, T)
Analyze bit depth and value range
Extract metadata (channels, timestamps, spatial calibration)
Calculate intensity statistics

For arrays (NPY, HDF5):

Check shape and dimensions
Analyze data type
Calculate statistical summaries
Check for missing/invalid values

Step 4: Generate Comprehensive Report

Create a markdown report with the following sections:

Required Sections:

Title and Metadata
- Filename and timestamp
- File size and location
Basic Information
- File properties
- Format identification
File Type Details
- Format description from reference
- Typical data content
- Common use cases
- Python libraries for reading
Data Analysis
- Structure and dimensions
- Statistical summaries
- Quality assessment
- Data characteristics
Key Findings
- Notable patterns
- Potential issues
- Quality metrics
Recommendations
- Preprocessing steps
- Appropriate analyses
- Tools and methods
- Visualization approaches

Template Location

Use assets/report_template.md as a guide for report structure.

Step 5: Save Report

Save the markdown report with a descriptive filename:

Pattern: {original_filename}_eda_report.md
Example: experiment_data.fastq → experiment_data_eda_report.md

Detailed Format References

Each reference file contains comprehensive information for dozens of file types. To find information about a specific format:

Identify the category from the extension
Read the appropriate reference file
Search for the section heading matching the extension (e.g., "### .pdb")
Extract the format information

Reference File Structure

Each format entry includes:

Description: What the format is
Typical Data: What it contains
Use Cases: Common applications
Python Libraries: How to read it (with code examples)
EDA Approach: Specific analyses to perform

Example lookup:

### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots

Best Practices

Reading Reference Files

Reference files are large (10,000+ words each). To efficiently use them:

Search by extension: Use grep to find the specific format

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information

Data Analysis

Sample large files: For files with millions of records, analyze a representative sample
Handle errors gracefully: Many scientific formats require specific libraries; provide clear installation instructions
Validate metadata: Cross-check metadata consistency (e.g., stated dimensions vs actual data)
Consider data provenance: Note instrument, software versions, processing steps

Report Generation

Be comprehensive: Include all relevant information for downstream analysis
Be specific: Provide concrete recommendations based on the file type
Be actionable: Suggest specific next steps and tools
Include code examples: Show how to load and work with the data

Examples

Example 1: Analyzing a FASTQ file

# User provides: "Analyze reads.fastq"

# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'

# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"

# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content

# 4. Generate report
# Include: format description, analysis results, QC recommendations

# 5. Save as: reads_eda_report.md

Example 2: Analyzing a CSV dataset

# User provides: "Explore experiment_results.csv"

# 1. Detect: .csv → general_scientific

# 2. Load reference for CSV format

# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations

# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results

# 5. Save report

Example 3: Analyzing microscopy data

# User provides: "Analyze cells.nd2"

# 1. Detect: .nd2 → microscopy_imaging (Nikon format)

# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader

# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
    # Extract: dimensions, channels, timepoints, metadata
    # Calculate: intensity statistics, frame info

# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis

# 5. Save report

Troubleshooting

Missing Libraries

Many scientific formats require specialized libraries:

Problem: Import error when trying to read a file

Solution: Provide clear installation instructions

try:
    from Bio import SeqIO
except ImportError:
    print("Install Biopython: uv pip install biopython")

Common requirements by category:

Bioinformatics: biopython, pysam, pyBigWig
Chemistry: rdkit, mdanalysis, cclib
Microscopy: tifffile, nd2reader, aicsimageio, pydicom
Spectroscopy: nmrglue, pymzml, pyteomics
General: pandas, numpy, h5py, scipy

Unknown File Types

If a file extension is not in the references:

Ask the user about the file format
Check if it's a vendor-specific variant
Attempt generic analysis based on file structure (text vs binary)
Provide general recommendations

Large Files

For very large files:

Use sampling strategies (first N records)
Use memory-mapped access (for HDF5, NPY)
Process in chunks (for CSV, FASTQ)
Provide estimates based on samples

Script Usage

The scripts/eda_analyzer.py can be used directly:

# Basic usage
python scripts/eda_analyzer.py data.csv

# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md

# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report

The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.

Advanced Usage

Multi-File Analysis

When analyzing multiple related files:

Perform individual EDA on each file
Create a summary comparison report
Identify relationships and dependencies
Suggest integration strategies

Quality Control

For data quality assessment:

Check format compliance
Validate metadata consistency
Assess completeness
Identify outliers and anomalies
Compare to expected ranges/distributions

Preprocessing Recommendations

Based on data characteristics, recommend:

Normalization strategies
Missing value imputation
Outlier handling
Batch correction
Format conversions

Resources

scripts/

eda_analyzer.py: Comprehensive analysis script that can be run directly or imported

references/

chemistry_molecular_formats.md: 60+ chemistry/molecular file formats
bioinformatics_genomics_formats.md: 50+ bioinformatics formats
microscopy_imaging_formats.md: 45+ imaging formats
spectroscopy_analytical_formats.md: 35+ spectroscopy formats
proteomics_metabolomics_formats.md: 30+ omics formats
general_scientific_formats.md: 30+ general formats

assets/

report_template.md: Comprehensive markdown template for EDA reports

About

SKILL.md

About

Perform comprehensive exploratory data analysis on scientific data files across 200+ file formats...

SKILL.md

Exploratory Data Analysis

Overview

Key Capabilities:

Automatic detection and analysis of 200+ scientific file formats
Comprehensive format-specific metadata extraction
Data quality and integrity assessment
Statistical summaries and distributions
Visualization recommendations
Downstream analysis suggestions
Markdown report generation

When to Use This Skill

Use this skill when:

User provides a path to a scientific data file for analysis
User asks to "explore", "analyze", or "summarize" a data file
User wants to understand the structure and content of scientific data
User needs a comprehensive report of a dataset before analysis
User wants to assess data quality or completeness
User asks what type of analysis is appropriate for a file

Supported File Categories

The skill has comprehensive coverage of scientific file formats organized into six major categories:

1. Chemistry and Molecular Formats (60+ extensions)

Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.

File types include: .pdb, .cif, .mol, .mol2, .sdf, .xyz, .smi, .gro, .log, .fchk, .cube, .dcd, .xtc, .trr, .prmtop, .psf, and more.

Reference file: references/chemistry_molecular_formats.md

2. Bioinformatics and Genomics Formats (50+ extensions)

Sequence data, alignments, annotations, variants, and expression data.

File types include: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, .mtx, and more.

Reference file: references/bioinformatics_genomics_formats.md

3. Microscopy and Imaging Formats (45+ extensions)

Microscopy images, medical imaging, whole slide imaging, and electron microscopy.

File types include: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff, and more.

Reference file: references/microscopy_imaging_formats.md

4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)

NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.

File types include: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif (crystallography), .wdf, and more.

Reference file: references/spectroscopy_analytical_formats.md

5. Proteomics and Metabolomics Formats (30+ extensions)

Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.

File types include: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad, and more.

Reference file: references/proteomics_metabolomics_formats.md

6. General Scientific Data Formats (30+ extensions)

Arrays, tables, hierarchical data, compressed archives, and common scientific formats.

File types include: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml, and more.

Reference file: references/general_scientific_formats.md

Workflow

Step 1: File Type Detection

When a user provides a file path, first identify the file type:

Extract the file extension
Look up the extension in the appropriate reference file
Identify the file category and format description
Load format-specific information

Example:

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

Step 2: Load Format-Specific Information

Based on the file type, read the corresponding reference file to understand:

Typical Data: What kind of data this format contains
Use Cases: Common applications for this format
Python Libraries: How to read the file in Python
EDA Approach: What analyses are appropriate for this data type

Search the reference file for the specific extension (e.g., search for "### .fastq" in bioinformatics_genomics_formats.md).

Step 3: Perform Data Analysis

Use the scripts/eda_analyzer.py script OR implement custom analysis:

Option A: Use the analyzer script

# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report

python scripts/eda_analyzer.py <filepath> [output.md]

Option B: Custom analysis in the conversation Based on the format information from the reference file, perform appropriate analysis:

For tabular data (CSV, TSV, Excel):

Load with pandas
Check dimensions, data types
Analyze missing values
Calculate summary statistics
Identify outliers
Check for duplicates

For sequence data (FASTA, FASTQ):

Count sequences
Analyze length distributions
Calculate GC content
Assess quality scores (FASTQ)

For images (TIFF, ND2, CZI):

Check dimensions (X, Y, Z, C, T)
Analyze bit depth and value range
Extract metadata (channels, timestamps, spatial calibration)
Calculate intensity statistics

For arrays (NPY, HDF5):

Check shape and dimensions
Analyze data type
Calculate statistical summaries
Check for missing/invalid values

Step 4: Generate Comprehensive Report

Create a markdown report with the following sections:

Required Sections:

Title and Metadata
- Filename and timestamp
- File size and location
Basic Information
- File properties
- Format identification
File Type Details
- Format description from reference
- Typical data content
- Common use cases
- Python libraries for reading
Data Analysis
- Structure and dimensions
- Statistical summaries
- Quality assessment
- Data characteristics
Key Findings
- Notable patterns
- Potential issues
- Quality metrics
Recommendations
- Preprocessing steps
- Appropriate analyses
- Tools and methods
- Visualization approaches

Template Location

Use assets/report_template.md as a guide for report structure.

Step 5: Save Report

Save the markdown report with a descriptive filename:

Pattern: {original_filename}_eda_report.md
Example: experiment_data.fastq → experiment_data_eda_report.md

Detailed Format References

Each reference file contains comprehensive information for dozens of file types. To find information about a specific format:

Identify the category from the extension
Read the appropriate reference file
Search for the section heading matching the extension (e.g., "### .pdb")
Extract the format information

Reference File Structure

Each format entry includes:

Description: What the format is
Typical Data: What it contains
Use Cases: Common applications
Python Libraries: How to read it (with code examples)
EDA Approach: Specific analyses to perform

Example lookup:

### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots

Best Practices

Reading Reference Files

Reference files are large (10,000+ words each). To efficiently use them:

Search by extension: Use grep to find the specific format

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information

Data Analysis

Sample large files: For files with millions of records, analyze a representative sample
Handle errors gracefully: Many scientific formats require specific libraries; provide clear installation instructions
Validate metadata: Cross-check metadata consistency (e.g., stated dimensions vs actual data)
Consider data provenance: Note instrument, software versions, processing steps

Report Generation

Be comprehensive: Include all relevant information for downstream analysis
Be specific: Provide concrete recommendations based on the file type
Be actionable: Suggest specific next steps and tools
Include code examples: Show how to load and work with the data

Examples

Example 1: Analyzing a FASTQ file

# User provides: "Analyze reads.fastq"

# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'

# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"

# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content

# 4. Generate report
# Include: format description, analysis results, QC recommendations

# 5. Save as: reads_eda_report.md

Example 2: Analyzing a CSV dataset

# User provides: "Explore experiment_results.csv"

# 1. Detect: .csv → general_scientific

# 2. Load reference for CSV format

# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations

# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results

# 5. Save report

Example 3: Analyzing microscopy data

# User provides: "Analyze cells.nd2"

# 1. Detect: .nd2 → microscopy_imaging (Nikon format)

# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader

# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
    # Extract: dimensions, channels, timepoints, metadata
    # Calculate: intensity statistics, frame info

# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis

# 5. Save report

Troubleshooting

Missing Libraries

Many scientific formats require specialized libraries:

Problem: Import error when trying to read a file

Solution: Provide clear installation instructions

try:
    from Bio import SeqIO
except ImportError:
    print("Install Biopython: uv pip install biopython")

Common requirements by category:

Bioinformatics: biopython, pysam, pyBigWig
Chemistry: rdkit, mdanalysis, cclib
Microscopy: tifffile, nd2reader, aicsimageio, pydicom
Spectroscopy: nmrglue, pymzml, pyteomics
General: pandas, numpy, h5py, scipy

Unknown File Types

If a file extension is not in the references:

Ask the user about the file format
Check if it's a vendor-specific variant
Attempt generic analysis based on file structure (text vs binary)
Provide general recommendations

Large Files

For very large files:

Use sampling strategies (first N records)
Use memory-mapped access (for HDF5, NPY)
Process in chunks (for CSV, FASTQ)
Provide estimates based on samples

Script Usage

The scripts/eda_analyzer.py can be used directly:

# Basic usage
python scripts/eda_analyzer.py data.csv

# Specify output file
python scripts/eda_analyzer.py data.csv output_report.md

# The script will:
# 1. Auto-detect file type
# 2. Load format references
# 3. Perform appropriate analysis
# 4. Generate markdown report

The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.

Advanced Usage

Multi-File Analysis

When analyzing multiple related files:

Perform individual EDA on each file
Create a summary comparison report
Identify relationships and dependencies
Suggest integration strategies

Quality Control

For data quality assessment:

Check format compliance
Validate metadata consistency
Assess completeness
Identify outliers and anomalies
Compare to expected ranges/distributions

Preprocessing Recommendations

Based on data characteristics, recommend:

Normalization strategies
Missing value imputation
Outlier handling
Batch correction
Format conversions

Resources

scripts/

eda_analyzer.py: Comprehensive analysis script that can be run directly or imported

references/

chemistry_molecular_formats.md: 60+ chemistry/molecular file formats
bioinformatics_genomics_formats.md: 50+ bioinformatics formats
microscopy_imaging_formats.md: 45+ imaging formats
spectroscopy_analytical_formats.md: 35+ spectroscopy formats
proteomics_metabolomics_formats.md: 30+ omics formats
general_scientific_formats.md: 30+ general formats

assets/

report_template.md: Comprehensive markdown template for EDA reports