From 7bde4091f5cf211d2cbab9542309b807fc43929b Mon Sep 17 00:00:00 2001
From: promptadmin <your@email.com>
Date: Wed, 10 Jun 2026 17:31:12 +0000
Subject: [PATCH] Add PubMed mining script

---
 databases/pubmed-literature-mining.md | 75 +++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 databases/pubmed-literature-mining.md

diff --git a/databases/pubmed-literature-mining.md b/databases/pubmed-literature-mining.md
new file mode 100644
index 0000000..82d67d4
--- /dev/null
+++ b/databases/pubmed-literature-mining.md
@@ -0,0 +1,75 @@
+---
+title: "PubMed Literature Mining Script"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [PubMed, literature-mining, Entrez, Biopython, NLP]
+task: "Generate Python code to mine PubMed for structured biological information."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/GoekeLab/awesome-genomic-skills
+  - https://github.com/inoue0426/awesome-computational-biology
+---
+
+# PubMed Literature Mining Script
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Generate Python code to mine PubMed for structured biological information.
+
+## Prompt
+
+```
+You are a bioinformatician building an automated literature mining pipeline.
+
+Generate Python code to:
+1. Query PubMed for: {search_query}
+   - Date range: {date_range}
+   - Maximum results: {max_results}
+   - Filters: {filters}
+
+2. For each paper extract:
+   - Title, authors, journal, year, PMID, DOI
+   - Abstract
+   - MeSH terms
+   - Chemical/gene mentions (using {ner_approach})
+
+3. Structure results as:
+   - pandas DataFrame with all fields
+   - JSON export with full metadata
+   - TSV for downstream analysis
+
+4. Generate summary statistics:
+   - Publication trend by year
+   - Top journals
+   - Co-occurrence network of key terms
+
+5. De-duplicate by DOI and title similarity
+
+Use Biopython Entrez, rate limiting (3 requests/sec), and email={your_email}.
+```
+
+## Notes
+
+Reference: SRAgent (Arc Institute) for SRA database querying patterns. GoekeLab/awesome-genomic-skills.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`PubMed` `literature-mining` `Entrez` `Biopython` `NLP`