Add PubMed mining script

2026-06-10 17:31:12 +00:00 · 2026-06-10 17:31:12 +00:00 · 7bde4091f5
parent b829de903d
commit 7bde4091f5
1 changed files with 75 additions and 0 deletions
--- a/databases/pubmed-literature-mining.md
+++ b/databases/pubmed-literature-mining.md
@ -0,0 +1,75 @@
+---
+title: "PubMed Literature Mining Script"
+domain: bioinformatics
+persona: "Bioinformatician"
+persona_background: >
+  Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+persona_style: "code-first, reproducibility-focused, cites tools and versions"
+models: [gpt-4, claude-3-5]
+keywords: [PubMed, literature-mining, Entrez, Biopython, NLP]
+task: "Generate Python code to mine PubMed for structured biological information."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/GoekeLab/awesome-genomic-skills
+  - https://github.com/inoue0426/awesome-computational-biology
+---
+
+# PubMed Literature Mining Script
+
+## Persona
+
+> You are a **Bioinformatician**. Senior bioinformatician with expertise in NGS pipelines, single-cell analysis, and workflow management (Nextflow/Snakemake).
+> Your communication style: code-first, reproducibility-focused, cites tools and versions
+
+## Task
+
+Generate Python code to mine PubMed for structured biological information.
+
+## Prompt
+
+```
+You are a bioinformatician building an automated literature mining pipeline.
+
+Generate Python code to:
+1. Query PubMed for: {search_query}
+   - Date range: {date_range}
+   - Maximum results: {max_results}
+   - Filters: {filters}
+
+2. For each paper extract:
+   - Title, authors, journal, year, PMID, DOI
+   - Abstract
+   - MeSH terms
+   - Chemical/gene mentions (using {ner_approach})
+
+3. Structure results as:
+   - pandas DataFrame with all fields
+   - JSON export with full metadata
+   - TSV for downstream analysis
+
+4. Generate summary statistics:
+   - Publication trend by year
+   - Top journals
+   - Co-occurrence network of key terms
+
+5. De-duplicate by DOI and title similarity
+
+Use Biopython Entrez, rate limiting (3 requests/sec), and email={your_email}.
+```
+
+## Notes
+
+Reference: SRAgent (Arc Institute) for SRA database querying patterns. GoekeLab/awesome-genomic-skills.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`PubMed` `literature-mining` `Entrez` `Biopython` `NLP`