diff --git a/.gitignore b/.gitignore index 5a383df4..6cd3acfa 100644 --- a/.gitignore +++ b/.gitignore @@ -15,9 +15,9 @@ .serena .windsurf .zed-ai -AGENTS.md -CLAUDE.md -GEMINI.md +AGENTS.local.md +CLAUDE.local.md +GEMINI.local.md # Cache __pycache__ diff --git a/.rules.md b/.rules.md new file mode 100644 index 00000000..baca98fd --- /dev/null +++ b/.rules.md @@ -0,0 +1,116 @@ +# Coding guidelines + +This file provides guidance to programming agents when working with code in this repository. + +## Project Overview + +The Apify SDK for Python (`apify` package on PyPI) is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides Actor lifecycle management, storage access (datasets, key-value stores, request queues), event handling, proxy configuration, and pay-per-event charging. It builds on top of the [Crawlee](https://crawlee.dev/python) web scraping framework and the [Apify API Client](https://docs.apify.com/api/client/python). Supports Python 3.10–3.14. Build system: hatchling. + +## Common Commands + +```bash +# Install dependencies (including dev) +uv sync --all-extras + +# Install dev dependencies + pre-commit hooks +uv run poe install-dev + +# Format code (also auto-fixes lint issues via ruff check --fix) +uv run poe format + +# Lint (format check + ruff check) +uv run poe lint + +# Type check +uv run poe type-check + +# Run all checks (lint + type-check + unit tests) +uv run poe check-code + +# Unit tests (no API token needed) +uv run poe unit-tests + +# Run a single test file +uv run pytest tests/unit/actor/test_actor_lifecycle.py + +# Run a single test by name +uv run pytest tests/unit/actor/test_actor_lifecycle.py -k "test_name" + +# Integration tests (needs APIFY_TEST_USER_API_TOKEN) +uv run poe integration-tests + +# E2E tests (needs APIFY_TEST_USER_API_TOKEN, builds/deploys Actors on platform) +uv run poe e2e-tests +``` + +## Code Style + +- **Formatter/Linter**: Ruff (line length 120, single quotes for inline, double quotes for docstrings) +- **Type checker**: ty (targets Python 3.10) +- **All ruff rules enabled** with specific ignores — see `pyproject.toml` `[tool.ruff.lint]` for the full ignore list +- Tests are exempt from docstring rules (`D`), assert warnings (`S101`), and private member access (`SLF001`) +- Unused imports are allowed in `__init__.py` files (re-exports) +- **Pre-commit hooks**: lint check + type check run automatically on commit + +## Architecture + +### Core (`src/apify/`) + +- **`_actor.py`** — The `_ActorType` class is the central API. `Actor` is a lazy-object-proxy (`lazy-object-proxy.Proxy`) wrapping `_ActorType` — it acts as both a class (e.g. `Actor.is_at_home()`) and an instance-like context manager (`async with Actor:`). On `__aenter__`, the proxy's `__wrapped__` is replaced with the active `_ActorType` instance. It manages the full Actor lifecycle (`init`, `exit`, `fail`), provides access to storages (`open_dataset`, `open_key_value_store`, `open_request_queue`), handles events, proxy configuration, charging, and platform API operations (`start`, `call`, `metamorph`, `reboot`). + +- **`_configuration.py`** — `Configuration` extends Crawlee's `Configuration` with Apify-specific settings (API URL, token, Actor run metadata, proxy settings, charging config). Configuration is populated from environment variables (`APIFY_*`). + +- **`_charging.py`** — Pay-per-event billing system. `ChargingManager` / `ChargingManagerImplementation` handle charging events against pricing info fetched from the API. + +- **`_proxy_configuration.py`** — `ProxyConfiguration` manages Apify proxy setup (residential, datacenter, groups, country targeting). + +- **`_models.py`** — Pydantic models for API data structures (Actor runs, webhooks, pricing info, etc.). + +### Storage Clients (`src/apify/storage_clients/`) + +Four storage client implementations, all implementing Crawlee's abstract storage client interface: + +- **`_apify/`** — `ApifyStorageClient`: talks to the Apify API for dataset, key-value store, and request queue operations (separate sub-clients for single vs. shared request queues). Used when running on the Apify platform. +- **`_file_system/`** — `FileSystemStorageClient` (alias `ApifyFileSystemStorageClient`): extends Crawlee's file system client with Apify-specific key-value store behavior. +- **`_smart_apify/`** — `SmartApifyStorageClient`: hybrid client that writes to both API and local file system for resilience. +- **`MemoryStorageClient`** — re-exported from Crawlee for in-memory storage. + +### Storages (`src/apify/storages/`) + +Re-exports Crawlee's `Dataset`, `KeyValueStore`, and `RequestQueue` classes. + +### Events (`src/apify/events/`) + +- **`_apify_event_manager.py`** — `ApifyEventManager` extends Crawlee's event system with platform-specific events received via WebSocket connection. + +### Request Loaders (`src/apify/request_loaders/`) + +- **`_apify_request_list.py`** — `ApifyRequestList` creates request lists from Actor input URLs (supports both direct URLs and "requests from URL" sources). + +### Scrapy Integration (`src/apify/scrapy/`) + +Optional integration (`apify[scrapy]` extra) providing Scrapy scheduler, middlewares, pipelines, and extensions for running Scrapy spiders as Apify Actors. + +### Key Dependencies + +- **`crawlee`** — Base framework providing storage abstractions, event system, configuration, service locator pattern +- **`apify-client`** — HTTP client for the Apify API (`ApifyClientAsync`) +- **`apify-shared`** — Shared constants and utilities (`ApifyEnvVars`, `ActorEnvVars`, etc.) + +## Testing + +Three test levels in `tests/`: + +- **`unit/`** — Fast tests with no external dependencies. Use mocked API clients (`ApifyClientAsyncPatcher` fixture). Run with `uv run poe unit-tests`. +- **`integration/`** — Tests making real Apify API calls but not deploying Actors. Requires `APIFY_TEST_USER_API_TOKEN`. Run with `uv run poe integration-tests`. +- **`e2e/`** — Full end-to-end tests that build and deploy Actors on the platform. Slowest. Requires `APIFY_TEST_USER_API_TOKEN`. Use `make_actor` and `run_actor` fixtures. Run with `uv run poe e2e-tests`. + +All test levels use `pytest-asyncio` with `asyncio_mode = "auto"` (no need for `@pytest.mark.asyncio`). Tests run in parallel via `pytest-xdist` (`--numprocesses`). Each test gets isolated state via the autouse `_isolate_test_environment` fixture which resets `Actor`, `service_locator`, and `AliasResolver` state. Conftest files live in each subdirectory (`tests/unit/conftest.py`, etc.) — there is no top-level `tests/conftest.py`. + +### Key Test Fixtures + +- **`apify_client_async_patcher`** (unit) — `ApifyClientAsyncPatcher` instance for mocking `ApifyClientAsync` methods. Patch by `method`/`submethod`, tracks call history in `.calls`. +- **`make_httpserver`/`httpserver`** (unit) — session-scoped `HTTPServer` via `pytest-httpserver` for HTTP interception. +- **`apify_client_async`** (integration/e2e) — real `ApifyClientAsync` using `APIFY_TEST_USER_API_TOKEN`. +- **`make_actor`** (e2e) — creates a temporary Actor on the platform from a function, `main_py` string, or source files dict; cleans up after the session. +- **`run_actor`** (e2e) — calls an Actor and waits up to 10 minutes for completion. diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 00000000..45ace44f --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +.rules.md \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 00000000..45ace44f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +.rules.md \ No newline at end of file diff --git a/GEMINI.md b/GEMINI.md new file mode 120000 index 00000000..45ace44f --- /dev/null +++ b/GEMINI.md @@ -0,0 +1 @@ +.rules.md \ No newline at end of file