Summary
Currently, PangaeaDataset.search_studies() triggers full dataset loading (metadata + data) for every study returned. This leads to excessive and unnecessary network calls, significantly increasing latency.
This issue proposes separating metadata loading (eager) from data loading (lazy) to improve performance while preserving functionality for summary and metadata-related methods.
Current Workflow (Problem Context)
When a user runs:
ds = PangaeaDataset()
df = ds.search_studies(investigators=["Khider, D"])
The following happens:
search_studies()
→ build query (query_builder)
→ execute search (PanQuery)
→ for each result:
→ create PangaeaStudy
→ instantiate PanDataSet
→ setMetadata() ← network call
→ setData() ← network call
👉 This results in:
- Multiple HTTP requests per study (metadata + data)
- Total runtime dominated by network + SSL overhead
- Data being downloaded even when not needed
Why This Is a Problem
search_studies() is primarily a discovery step, but it behaves like a full data fetch
- Most downstream methods (
get_summary, get_publications, get_geo) only require metadata, not data tables
- Large queries scale poorly due to unnecessary data downloads
Key Observation
From PanDataSet behavior:
setMetadata() → fetches authors, events, citation, etc. (needed for summaries)
setData() → fetches tabular dataset (only needed for get_data())
👉 These should not be coupled during initialization
Proposed Workflow (Improved Design)
search_studies()
→ build query
→ execute search
→ for each result:
→ create PangaeaStudy
→ PanDataSet(include_data=False)
→ setMetadata() only
Then:
get_data()
→ if data not loaded:
→ call setData()
Proposed Implementation
In PangaeaStudy.__init__:
self._panobj = PanDataSet(
id=study_id,
include_data=False
)
In get_data():
if self._panobj.data.empty:
self._panobj.setData()
return self._panobj.data
Benefits
-
🚀 Faster search_studies() (no data download)
-
📉 Fewer network calls
-
✅ No loss of functionality for metadata-based methods
-
🧠 Clear separation of concerns:
- search = discovery
- data access = explicit
Additional Notes
Conclusion
This change ensures that:
search_studies() remains lightweight and scalable, while data retrieval happens only when explicitly requested.
This will significantly improve performance and user experience without breaking existing metadata workflows.
Summary
Currently,
PangaeaDataset.search_studies()triggers full dataset loading (metadata + data) for every study returned. This leads to excessive and unnecessary network calls, significantly increasing latency.This issue proposes separating metadata loading (eager) from data loading (lazy) to improve performance while preserving functionality for summary and metadata-related methods.
Current Workflow (Problem Context)
When a user runs:
The following happens:
👉 This results in:
Why This Is a Problem
search_studies()is primarily a discovery step, but it behaves like a full data fetchget_summary,get_publications,get_geo) only require metadata, not data tablesKey Observation
From
PanDataSetbehavior:setMetadata()→ fetches authors, events, citation, etc. (needed for summaries)setData()→ fetches tabular dataset (only needed forget_data())👉 These should not be coupled during initialization
Proposed Workflow (Improved Design)
Then:
Proposed Implementation
In
PangaeaStudy.__init__:In
get_data():Benefits
🚀 Faster
search_studies()(no data download)📉 Fewer network calls
✅ No loss of functionality for metadata-based methods
🧠 Clear separation of concerns:
Additional Notes
This approach aligns with typical data API patterns (lazy data access)
Further improvements could include:
fetch_data=Trueflag for eager usersConclusion
This change ensures that:
This will significantly improve performance and user experience without breaking existing metadata workflows.