Separate Metadata and Data Loading in `PangaeaStudy` to Avoid Excessive Network Calls


### Summary

Currently, `PangaeaDataset.search_studies()` triggers full dataset loading (metadata + data) for every study returned. This leads to excessive and unnecessary network calls, significantly increasing latency.

This issue proposes separating **metadata loading (eager)** from **data loading (lazy)** to improve performance while preserving functionality for summary and metadata-related methods.

---

### Current Workflow (Problem Context)

When a user runs:

```python
ds = PangaeaDataset()
df = ds.search_studies(investigators=["Khider, D"])
```

The following happens:

```text
search_studies()
    → build query (query_builder)
    → execute search (PanQuery)
    → for each result:
        → create PangaeaStudy
            → instantiate PanDataSet
                → setMetadata()   ← network call
                → setData()       ← network call
```

👉 This results in:

* Multiple HTTP requests per study (metadata + data)
* Total runtime dominated by network + SSL overhead
* Data being downloaded even when not needed

---

### Why This Is a Problem

* `search_studies()` is primarily a **discovery step**, but it behaves like a **full data fetch**
* Most downstream methods (`get_summary`, `get_publications`, `get_geo`) only require **metadata**, not data tables
* Large queries scale poorly due to unnecessary data downloads

---

### Key Observation

From `PanDataSet` behavior:

* `setMetadata()` → fetches authors, events, citation, etc. (needed for summaries)
* `setData()` → fetches tabular dataset (only needed for `get_data()`)

👉 These should not be coupled during initialization

---

### Proposed Workflow (Improved Design)

```text
search_studies()
    → build query
    → execute search
    → for each result:
        → create PangaeaStudy
            → PanDataSet(include_data=False)
                → setMetadata() only
```

Then:

```text
get_data()
    → if data not loaded:
        → call setData()
```

---

### Proposed Implementation

In `PangaeaStudy.__init__`:

```python
self._panobj = PanDataSet(
    id=study_id,
    include_data=False
)
```

In `get_data()`:

```python
if self._panobj.data.empty:
    self._panobj.setData()
return self._panobj.data
```

---

### Benefits

* 🚀 Faster `search_studies()` (no data download)
* 📉 Fewer network calls
* ✅ No loss of functionality for metadata-based methods
* 🧠 Clear separation of concerns:

  * search = discovery
  * data access = explicit

---

### Additional Notes

* This approach aligns with typical data API patterns (lazy data access)
* Further improvements could include:

  * parallel metadata loading
  * optional `fetch_data=True` flag for eager users

---

### Conclusion

This change ensures that:

> `search_studies()` remains lightweight and scalable, while data retrieval happens only when explicitly requested.

This will significantly improve performance and user experience without breaking existing metadata workflows.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate Metadata and Data Loading in `PangaeaStudy` to Avoid Excessive Network Calls #53

Summary

Current Workflow (Problem Context)

Why This Is a Problem

Key Observation

Proposed Workflow (Improved Design)

Proposed Implementation

Benefits

Additional Notes

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Separate Metadata and Data Loading in PangaeaStudy to Avoid Excessive Network Calls #53

Description

Summary

Current Workflow (Problem Context)

Why This Is a Problem

Key Observation

Proposed Workflow (Improved Design)

Proposed Implementation

Benefits

Additional Notes

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Separate Metadata and Data Loading in `PangaeaStudy` to Avoid Excessive Network Calls #53