Add PubMed mining script

Add Nextflow pipeline designer
Add DESeq2 workflow
2026-06-10 17:31:12 +00:00 · 2026-06-10 17:31:11 +00:00 · 2026-06-10 17:31:09 +00:00 · 2026-06-10 17:31:08 +00:00 · 2026-06-10 17:31:07 +00:00 · 2026-06-10 17:31:05 +00:00
6 changed files with 363 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,10 @@
-# bioinformatics-code-prompts
+# Bioinformatics Code Prompts

-Code generation prompts for Python bioinformatics (Biopython, Scanpy, RDKit) and R/Bioconductor.
+Code generation and explanation prompts for Python bioinformatics
+(Biopython, Scanpy, RDKit) and R/Bioconductor workflows.
+
+## Source Repositories
+- [awesome-genomic-skills](https://github.com/GoekeLab/awesome-genomic-skills)
+- [awesome-computational-biology](https://github.com/inoue0426/awesome-computational-biology)
+- [Awesome_BigData_AI_DrugDiscovery](https://github.com/Bin-Chen-Lab/Awesome_BigData_AI_DrugDiscovery)
+- [scientific-agent-skills](https://github.com/K-Dense-AI/scientific-agent-skills)
--- a/databases/pubmed-literature-mining.md
+++ b/databases/pubmed-literature-mining.md
@ -0,0 +1,75 @@
+---
+title: "PubMed Literature Mining Script"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [PubMed, literature-mining, Entrez, Biopython, NLP]
+task: "Generate Python code to mine PubMed for structured biological information."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/GoekeLab/awesome-genomic-skills
+  - https://github.com/inoue0426/awesome-computational-biology
+---
+
+# PubMed Literature Mining Script
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Generate Python code to mine PubMed for structured biological information.
+
+## Prompt
+
+```
+You are a bioinformatician building an automated literature mining pipeline.
+
+Generate Python code to:
+1. Query PubMed for: {search_query}
+   - Date range: {date_range}
+   - Maximum results: {max_results}
+   - Filters: {filters}
+
+2. For each paper extract:
+   - Title, authors, journal, year, PMID, DOI
+   - Abstract
+   - MeSH terms
+   - Chemical/gene mentions (using {ner_approach})
+
+3. Structure results as:
+   - pandas DataFrame with all fields
+   - JSON export with full metadata
+   - TSV for downstream analysis
+
+4. Generate summary statistics:
+   - Publication trend by year
+   - Top journals
+   - Co-occurrence network of key terms
+
+5. De-duplicate by DOI and title similarity
+
+Use Biopython Entrez, rate limiting (3 requests/sec), and email={your_email}.
+```
+
+## Notes
+
+Reference: SRAgent (Arc Institute) for SRA database querying patterns. GoekeLab/awesome-genomic-skills.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`PubMed` `literature-mining` `Entrez` `Biopython` `NLP`
--- a/pipelines/nextflow-design.md
+++ b/pipelines/nextflow-design.md
@ -0,0 +1,71 @@
+---
+title: "Nextflow Pipeline Designer"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [Nextflow, pipeline, workflow, DSL2, containerisation]
+task: "Design and generate a Nextflow DSL2 bioinformatics pipeline."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/GoekeLab/awesome-genomic-skills
+---
+
+# Nextflow Pipeline Designer
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Design and generate a Nextflow DSL2 bioinformatics pipeline.
+
+## Prompt
+
+```
+You are a pipeline engineer expert in Nextflow DSL2 and bioinformatics workflow design.
+
+Design a Nextflow DSL2 pipeline for:
+- Analysis type: {analysis_type}
+- Input: {input_description}
+- Tools required: {tools}
+- Reference files: {references}
+- HPC/cloud: {compute_environment}
+
+Generate:
+1. main.nf with workflow definition
+2. modules/ structure (one process per tool)
+3. nextflow.config with resource profiles
+4. params.yml template
+5. Docker/Singularity container specifications
+
+Each process should include:
+- Tag directives for logging
+- Error strategy (retry/ignore)
+- Resource labels (small/medium/large)
+- Input/output type declarations
+- publishDir for results
+
+Include a workflow diagram in Mermaid format.
+```
+
+## Notes
+
+Follows nf-core pipeline standards. Reference: GoekeLab/awesome-genomic-skills — BioAgent Bench pipeline tasks.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`Nextflow` `pipeline` `workflow` `DSL2` `containerisation`
--- a/python/rdkit/molecular-fingerprinting.md
+++ b/python/rdkit/molecular-fingerprinting.md
@ -0,0 +1,67 @@
+---
+title: "RDKit Molecular Property Calculator"
+domain: bioinformatics
+persona: "Computational Chemist"
+persona_background: >
+  Computational chemist expert in molecular docking, QSAR modelling, and virtual screening.
+persona_style: "quantitative, references docking scores and force fields"
+models: [gpt-4, claude-3-5]
+keywords: [RDKit, cheminformatics, molecular-properties, SMILES, fingerprints]
+task: "Generate Python code for molecular property calculation and filtering using RDKit."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/K-Dense-AI/scientific-agent-skills
+  - https://github.com/Bin-Chen-Lab/Awesome_BigData_AI_DrugDiscovery
+---
+
+# RDKit Molecular Property Calculator
+
+## Persona
+
+> You are a **Computational Chemist**. Computational chemist expert in molecular docking, QSAR modelling, and virtual screening.
+> Your communication style: quantitative, references docking scores and force fields
+
+## Task
+
+Generate Python code for molecular property calculation and filtering using RDKit.
+
+## Prompt
+
+```
+You are a cheminformatics expert using RDKit for drug-like property analysis.
+
+Generate Python code to:
+1. Load molecules from: {input_format} (SMILES list / SDF / CSV)
+2. Calculate Lipinski Ro5 properties (MW, LogP, HBD, HBA)
+3. Calculate additional drug-likeness metrics: {additional_metrics}
+4. Apply filters: {filters}
+5. Generate Morgan fingerprints (radius={radius}, nbits={nbits})
+6. Calculate Tanimoto similarity to reference: {reference_smiles}
+7. Visualise molecules failing filters
+8. Export passing compounds to {output_format}
+
+Include:
+- Proper error handling for invalid SMILES
+- Progress bar for large datasets
+- Summary statistics table
+- Scatter plot of MW vs LogP with Ro5 boundaries
+
+Use pandas, matplotlib, and rdkit.Chem standard practices.
+```
+
+## Notes
+
+Reference: ChemDescriptor and RDKit tutorials. K-Dense-AI/scientific-agent-skills — cheminformatics skills. Bin-Chen-Lab/Awesome_BigData_AI_DrugDiscovery.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`RDKit` `cheminformatics` `molecular-properties` `SMILES` `fingerprints`
--- a/python/scanpy/scrna-seq-pipeline.md
+++ b/python/scanpy/scrna-seq-pipeline.md
@ -0,0 +1,71 @@
+---
+title: "scRNA-seq Analysis Pipeline Generator"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [scRNA-seq, Scanpy, single-cell, clustering, UMAP, Seurat]
+task: "Generate a complete single-cell RNA-seq analysis pipeline in Python using Scanpy."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/inoue0426/awesome-computational-biology
+  - https://github.com/GoekeLab/awesome-genomic-skills
+---
+
+# scRNA-seq Analysis Pipeline Generator
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Generate a complete single-cell RNA-seq analysis pipeline in Python using Scanpy.
+
+## Prompt
+
+```
+You are a senior bioinformatician specialising in single-cell genomics.
+
+Generate a complete, runnable Scanpy pipeline for:
+- Data: {data_description}
+- Input format: {input_format} (10x/h5ad/loom)
+- Organism: {organism}
+- Expected cell types: {expected_cell_types}
+- Analysis goals: {goals}
+
+Include:
+1. Data loading and quality control (mitochondrial %, doublet detection)
+2. Normalisation and log-transformation
+3. Highly variable gene selection
+4. PCA and batch correction (if applicable: {batch_correction_method})
+5. Neighbourhood graph and UMAP
+6. Leiden clustering (resolution: {resolution})
+7. Marker gene identification (Wilcoxon rank-sum)
+8. Cell type annotation
+9. Differential expression between conditions: {conditions}
+10. Visualisation code (UMAP, dotplot, violin)
+
+Add comments explaining biological rationale for each step.
+Include error handling for common issues (empty droplets, batch effects).
+```
+
+## Notes
+
+Reference: scGPT and scFoundation foundation models for annotation validation. awesome-computational-biology (inoue0426).
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`scRNA-seq` `Scanpy` `single-cell` `clustering` `UMAP` `Seurat`
--- a/r/bioconductor/deseq2-analysis.md
+++ b/r/bioconductor/deseq2-analysis.md
@ -0,0 +1,70 @@
+---
+title: "DESeq2 Differential Expression Workflow (R)"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [DESeq2, RNA-seq, differential-expression, R, Bioconductor]
+task: "Generate a complete DESeq2 differential expression analysis in R."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/inoue0426/awesome-computational-biology
+---
+
+# DESeq2 Differential Expression Workflow (R)
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Generate a complete DESeq2 differential expression analysis in R.
+
+## Prompt
+
+```
+You are a bioinformatician expert in R/Bioconductor RNA-seq analysis.
+
+Generate a complete DESeq2 workflow for:
+- Count matrix: {count_matrix_description}
+- Metadata: {metadata_description}
+- Design formula: {design_formula}
+- Contrast: {contrast}
+- Organism: {organism} (for annotation)
+
+Include:
+1. Data loading and colData creation
+2. DESeqDataSet construction with design
+3. Pre-filtering (low count removal)
+4. DESeq() normalisation and dispersion estimation
+5. Results extraction with {padj_threshold} FDR threshold
+6. Independent filtering plot
+7. MA plot and volcano plot (ggplot2)
+8. Heatmap of top 50 DE genes (pheatmap)
+9. PCA plot coloured by condition
+10. GO/KEGG enrichment with clusterProfiler
+11. Results export to CSV
+
+Add statistical QC notes for each step.
+```
+
+## Notes
+
+Reference: DESeq2 paper (Love et al. 2014) best practices. awesome-computational-biology (inoue0426).
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`DESeq2` `RNA-seq` `differential-expression` `R` `Bioconductor`
Author	SHA1	Message	Date
promptadmin	7bde4091f5	Add PubMed mining script	2026-06-10 17:31:12 +00:00
promptadmin	b829de903d	Add Nextflow pipeline designer	2026-06-10 17:31:11 +00:00
promptadmin	f69738c96b	Add DESeq2 workflow	2026-06-10 17:31:09 +00:00
promptadmin	ea44e46f00	Add RDKit property calculator	2026-06-10 17:31:08 +00:00
promptadmin	f7532b0ea4	Add scRNA-seq pipeline generator	2026-06-10 17:31:07 +00:00
promptadmin	c10216df36	Add README	2026-06-10 17:31:05 +00:00