Skip to content

Separate Metadata and Data Loading in PangaeaStudy to Avoid Excessive Network Calls #53

@doswal

Description

@doswal

Summary

Currently, PangaeaDataset.search_studies() triggers full dataset loading (metadata + data) for every study returned. This leads to excessive and unnecessary network calls, significantly increasing latency.

This issue proposes separating metadata loading (eager) from data loading (lazy) to improve performance while preserving functionality for summary and metadata-related methods.


Current Workflow (Problem Context)

When a user runs:

ds = PangaeaDataset()
df = ds.search_studies(investigators=["Khider, D"])

The following happens:

search_studies()
    → build query (query_builder)
    → execute search (PanQuery)
    → for each result:
        → create PangaeaStudy
            → instantiate PanDataSet
                → setMetadata()   ← network call
                → setData()       ← network call

👉 This results in:

  • Multiple HTTP requests per study (metadata + data)
  • Total runtime dominated by network + SSL overhead
  • Data being downloaded even when not needed

Why This Is a Problem

  • search_studies() is primarily a discovery step, but it behaves like a full data fetch
  • Most downstream methods (get_summary, get_publications, get_geo) only require metadata, not data tables
  • Large queries scale poorly due to unnecessary data downloads

Key Observation

From PanDataSet behavior:

  • setMetadata() → fetches authors, events, citation, etc. (needed for summaries)
  • setData() → fetches tabular dataset (only needed for get_data())

👉 These should not be coupled during initialization


Proposed Workflow (Improved Design)

search_studies()
    → build query
    → execute search
    → for each result:
        → create PangaeaStudy
            → PanDataSet(include_data=False)
                → setMetadata() only

Then:

get_data()
    → if data not loaded:
        → call setData()

Proposed Implementation

In PangaeaStudy.__init__:

self._panobj = PanDataSet(
    id=study_id,
    include_data=False
)

In get_data():

if self._panobj.data.empty:
    self._panobj.setData()
return self._panobj.data

Benefits

  • 🚀 Faster search_studies() (no data download)

  • 📉 Fewer network calls

  • ✅ No loss of functionality for metadata-based methods

  • 🧠 Clear separation of concerns:

    • search = discovery
    • data access = explicit

Additional Notes

  • This approach aligns with typical data API patterns (lazy data access)

  • Further improvements could include:

    • parallel metadata loading
    • optional fetch_data=True flag for eager users

Conclusion

This change ensures that:

search_studies() remains lightweight and scalable, while data retrieval happens only when explicitly requested.

This will significantly improve performance and user experience without breaking existing metadata workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions