Automated Lead Identification, Enrichment, and Ranking for 3D In-Vitro Model Adoption
The Business Development Intelligence System is an end-to-end data pipeline and dashboard designed to help business development teams identify, enrich, and prioritize high-probability leads for 3D in-vitro models used in drug discovery and therapy design.
The system automates what is traditionally a manual and error-prone workflow involving PubMed reviews, conference scanning, and funding research, and replaces it with a repeatable, explainable, and configurable intelligence pipeline.
The final output is a ranked, searchable, and exportable dashboard showing decision-makers most likely to engage, based on scientific intent, organizational readiness, and budget signals.
Business development in biotech faces three core challenges:
- Low signal-to-noise: Thousands of scientists exist, but only a small subset are relevant, funded, and ready to engage.
- Manual research overhead: LinkedIn searches, PubMed reviews, and funding checks are slow and inconsistent.
- Poor prioritization: Teams lack a systematic way to decide who to contact first.
This system addresses these challenges by turning publicly available scientific, professional, and business data into a prioritized lead list with an explainable propensity-to-buy score.
The system is organized into three sequential stages, followed by a presentation layer:
→ Stage 1: Identification
→ Stage 2: Enrichment
→ Stage 3: Ranking
→ Lead Generation Dashboard
Each stage is independently runnable and testable, but also orchestrated through a single pipeline runner.
Objective: Maximize recall by identifying all potentially relevant individuals before any scoring or filtering.
Data Sources:
- LinkedIn / Sales Navigator results provided as a pre-generated CSV export (e.g., from Clay) located in
src/data/input/ - Scientific publication databases (PubMed)
- Biomedical conferences (SOT, AACR, ISSX, ACT)
What is identified:
- Scientists and decision-makers in toxicology, safety, hepatic biology, and preclinical research
- Authors of recent (last 24 months) relevant publications
- Conference speakers, poster presenters, and exhibitors
Output:
stage1_candidates.csv- Each candidate retains source evidence (LinkedIn / PubMed / Conference) without scoring
This stage intentionally allows noise and duplicates, prioritizing coverage over precision to avoid missing high-value leads.
Objective: Convert identified candidates into BD-ready profiles by adding contactability, geographic clarity, and budget context.
Enrichment Signals:
- Business email and phone (Apollo, with mock fallback)
- Job title normalization
- Employment tenure (new-hire signal)
- Person location (remote vs office)
- Company HQ location
- Funding stage (Seed, Series A, Series B, Public, Grant-funded)
- Active researcher flag
- Publication count
- NIH grant detection (NIH RePORTER)
Output:
stage2_enriched.csv
At this stage, no ranking is applied. The goal is to gather all evidence required for scoring.
Objective: Assign an explainable 0–100 probability score representing the likelihood that a person wants to work with 3D in-vitro models.
The model is rule-based, deterministic, and fully explainable, aligned exactly with the assignment rubric.
| Signal Category | Criteria | Weight |
|---|---|---|
| Role Fit | Toxicology, Safety, Hepatic, 3D roles | +30 |
| Scientific Intent | Recent publication on liver toxicity / DILI | +40 |
| Company Intent | Series A / B funding | +20 |
| Technographic | Uses or is open to advanced models (proxy) | +15 |
| Location | Biotech hubs (Boston, Cambridge, Basel, etc.) | +10 |
| Tenure | New hire (<2.5 years) | +5 |
All weights, keywords, and hub locations are externalized into YAML configuration files.
Output:
stage3_ranked_leads.csv- Includes score breakdown for transparency
All business logic is configurable without code changes:
keywords.yaml– role and scientific keywordshubs.yaml– geographic hubsscoring_weights.yaml– scoring weights
This allows:
- Rapid tuning by Business Development teams
- Easy regional or therapeutic focus changes
- Auditability of scoring logic
src/
├── config/
│ ├── keywords.yaml
│ ├── hubs.yaml
│ └── scoring_weights.yaml
│
├── ingestion/ # Stage 1
├── enrichment/ # Stage 2
├── scoring/ # Stage 3
├── pipeline_run_all.py # Orchestration
├── dashboard/ # Streamlit UI
├── data/ # CSV outputs
The system provides a single CLI entry point that runs all stages in order.
- Dummy mode: Uses mock enrichment (no paid APIs)
- Actual mode: Uses real APIs when keys are provided
python -m src.pipeline_run_all --mode dummy
python -m src.pipeline_run_all --mode actual --max-linkedin-leads <value> --max-pubmed-leads <value> --max-conference-leads <value>Each stage can also be run independently for debugging or development.
python -m src.ingestion.stage1_pipeline
python -m src.enrichment.stage2_pipeline
python -m src.scoring.stage3_pipelineThe final output is a Streamlit web dashboard designed for business users.
- Ranked lead table
- Free-text search (e.g. “Boston”, “Oncology”)
- Clear split between person location and company HQ
- Action column for outreach prioritization
- CSV and Excel export (strict schema)
- Rank
- Probability
- Name
- Title
- Company
- Location
- HQ
- Action
streamlit run src/dashboard/app.py- No direct LinkedIn scraping
- Professional data accessed via compliant tools (Clay, Apollo)
- Scientific data sourced from public APIs (PubMed, NIH RePORTER)
- Dummy mode enables safe demos without API keys
- Machine-learning based scoring layer
- CRM integrations (HubSpot / Salesforce)
- Conference site scrapers
- Multi-region hub configuration
- Lead status tracking