diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index d0bcb14..0ad4e93 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1,85 +1,83 @@ -Purpose -------- -This file gives concise, actionable guidance for AI coding agents working on the `webinfo` Go module. - -What this project does ----------------------- -Extracts metadata (title, description, canonical, image, etc.) from web pages and provides utilities -to fetch and save representative images and create thumbnails. - -Quick entry points ------------------- -- **Primary package**: `webinfo` — key files: - - `fetch.go` (core `Fetch` function and encoding handling) - - `webinfo.go` (`Webinfo` type, `DownloadImage`, and `DownloadThumbnail`) - - `errs.go` (error sentinel values) - - `fetch_test.go` (behavioral tests and examples) -- **Go module**: `go 1.25` (see `go.mod`). - -Developer workflows -------------------- -- Run full CI/test workflow using the Taskfile (recommended if `task` is installed): - - `task test` — runs `go mod verify`, `go test -shuffle on ./...`, `govulncheck`, and `golangci-lint-v2` as configured in `Taskfile.yml`. -- Quick test: `go test ./...` (useful during fast iteration). -- Prepare module: `go mod tidy -v -go=1.25` (mirrors `prepare` in `Taskfile.yml`). - -Project-specific conventions and patterns ----------------------------------------- -- Error handling: uses `github.com/goark/errs`. Prefer `errs.Wrap(err, errs.WithContext("key", val))` for context-rich errors and `errs.Join` when combining close errors in `defer`. -- HTTP fetching: uses `github.com/goark/fetch`. Typical pattern: - - Parse URL with `fetch.URL(...)`. - - Use `fetch.New(...).GetWithContext(ctx, parsed, fetch.WithRequestHeaderSet("User-Agent", ua))`. -- Default User-Agent: `getUserAgent("")` returns a dummy UA string. Functions accept a `userAgent` param but fall back to this default. -- Encoding: `Fetch` peeks the first 1024 bytes and uses `charset.DetermineEncoding` and `encoding.GetEncoding(name)` to decode response bodies before HTML parsing — preserve this approach when touching parsing logic. -- HTML parsing: `goquery` is used to select head elements and meta tags. Extraction precedence is explicit in `fetch.go` (title → `twitter:title`/`og:title`, description → `twitter:description`/`og:description`, image → `twitter:image`/`og:image`). Follow this precedence in code changes or tests. - -Image download and thumbnail notes ---------------------------------- -- `DownloadImage` (in `webinfo.go`) downloads `w.ImageURL` and saves it to disk. It determines the output file extension using this order: - 1) extension from the URL path, - 2) extensions inferred from the response `Content-Type` header, - 3) sniffing the first up to 512 bytes via `http.DetectContentType`, - 4) fallback to `.img` if none found. - When sniffing, the read bytes are prepended back into the response body with `io.MultiReader` so the full image is written. -- `DownloadThumbnail` (added to `webinfo.go`) downloads the original image (via `DownloadImage`), resizes it to a requested width (preserving aspect ratio) and writes a thumbnail. Implementation notes: - - The code currently uses a local nearest-neighbor scaler (no external `x/image/draw` dependency) to avoid adding module requirements. - - The method accepts `width` (default 150 when <= 0), `destDir`, and `temporary` flags. When `destDir` is empty the method forces creation of a temporary file. - - When `temporary` is false, the thumbnail filename is derived from the original image basename with `-thumb` appended before the extension. - -I/O and cleanup ----------------- -- Response bodies and files are closed; close errors are wrapped/joined with any existing error. -- Errors encountered while parsing the URL, fetching, reading, sniffing, creating directories/files, or copying data are wrapped with contextual information (e.g. `"url"`, `"path"`, `"dir"`, `"file"`) using the `errs` package. - -Tests and examples ------------------- -- Tests use `net/http/httptest` for deterministic responses (encoding tests use `golang.org/x/text/encoding/japanese`). Inspect `fetch_test.go` for examples of: - - Redirect handling and validation of `Location`. - - Encoding tests for Shift_JIS and ISO-2022-JP. - - Verifying `User-Agent` header usage. -- Example usage patterns to follow when adding code or tests: - - Fetch: `info, err := Fetch(ctx, "https://example.com", "")` — empty UA uses the default. - - Download image: `outPath, err := w.DownloadImage(ctx, "images", true)` - - Download thumbnail: `thumbPath, err := w.DownloadThumbnail(ctx, "thumbnails", 150, false)` - -External dependencies & integration points ----------------------------------------- -- Key dependencies in `go.mod`: `github.com/goark/fetch`, `github.com/goark/errs`, `github.com/PuerkitoBio/goquery`, `golang.org/x/text` (encodings). -- The repository intentionally avoids adding `golang.org/x/image/draw` as a dependency; if you need higher-quality scaling consider adding it and updating `go.mod` and tests. -- The `Taskfile.yml` runs additional tools: `govulncheck`, `golangci-lint-v2`, and (optionally) `nancy` via `depm` — keep CI tool invocations in sync when adding dependencies. - -When modifying public APIs -------------------------- -- Maintain existing error-wrapping conventions (`errs.Wrap`, `errs.WithContext`). -- Preserve encoding detection behavior and the 1024-byte peek in `Fetch` unless a clear, tested performance reason exists. -- Preserve `DownloadImage`'s extension-detection order and the behavior of `temporary` vs permanent files. When adding `DownloadThumbnail` behavior or changing file-naming semantics, update tests accordingly. - -Where to look next (high-value files) -------------------------------------- -- `fetch.go` — how pages are fetched, decoded and parsed. -- `webinfo.go` — `Webinfo` type, `DownloadImage`, and `DownloadThumbnail` implementations. -- `fetch_test.go` — canonical tests and examples you should mirror for new behaviors. -- `errs.go` and `go.mod` — error constants and dependency hints. -- `Taskfile.yml` — canonical developer/test/lint workflow. - -If anything above is unclear or you want small patches, test templates, or a CI-safe refactor suggestion, tell me which area to expand and I will iterate. +# Copilot Instructions for `goark/webinfo` + +## Project purpose + +`webinfo` extracts metadata from web pages and provides helpers for +image download and thumbnail generation. + +## Design principles + +- Keep public APIs small and explicit. +- Preserve metadata extraction precedence and deterministic behavior. +- Keep context-based fetch operations as the default path. +- Preserve compatibility of exported symbols when possible. + +## Error handling + +- Use `github.com/goark/errs` for internal error handling. +- Prefer `errs.Wrap`, `errs.Join`, and `errs.WithContext`. +- Keep `errors.Is` compatibility for callers. +- Keep sentinel errors stable (`ErrInvalidURL`, `ErrNoImageURL`, `ErrNullPointer`). +- Include useful context keys such as `url`, `path`, and `dir`. + +## Fetch and parsing behavior + +- Use `github.com/goark/fetch` for HTTP operations. +- Keep the default User-Agent fallback behavior. +- Preserve encoding detection flow in `Fetch` (1024-byte peek + charset detection). +- Keep extraction precedence unchanged: + - title: `title` -> `twitter:title` -> `og:title` + - description: `meta[name=description]` -> `twitter:description` -> `og:description` + - image: `twitter:image` -> `og:image` + +## Image and thumbnail behavior + +- Keep `DownloadImage` extension detection order: + 1) URL path extension + 2) `Content-Type` based extension + 3) content sniffing (`http.DetectContentType`) + 4) fallback `.img` +- Keep the sniffed bytes prepended back to the body reader. +- Keep temporary/permanent file behavior stable. +- Keep thumbnail default width behavior (`width <= 0` -> `150`). + +## Coding style + +- Write idiomatic Go with straightforward control flow. +- Avoid unnecessary dependencies. +- Keep comments concise and in English. + +## Testing and validation + +- Add or update tests for behavior changes. +- Prefer local validation with Taskfile targets: + - `task test` + - `task govulncheck` + +## Documentation + +- Keep `README.md` aligned with public API behavior. +- Keep examples concise and runnable. + +## Release process + +- Create release tags from `main`. +- Use semantic versioning tags in `vMAJOR.MINOR.PATCH` format. +- Ensure repository is clean and synced before tagging. + +Release steps: + +1. Ensure `main` is up to date. +2. Create annotated tag: + - `git tag -a vX.Y.Z -m "Release vX.Y.Z"` +3. Push tag: + - `git push origin vX.Y.Z` +4. Create GitHub release with autogenerated notes: + - `gh release create vX.Y.Z --generate-notes` + +Verification steps: + +- Check tag exists: + - `git tag -l "vX.Y.Z"` +- Check release exists: + - `gh release view vX.Y.Z` diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..43e5071 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,44 @@ +name: ci + +on: + push: + branches: + - main + pull_request: + +permissions: + contents: read + +jobs: + test-and-lint: + name: lint and test + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - uses: actions/setup-go@v6 + with: + go-version-file: go.mod + cache-dependency-path: go.sum + + - name: golangci-lint + uses: golangci/golangci-lint-action@v9 + with: + version: latest + args: --enable gosec + + - name: Test module + run: go test -shuffle on ./... + + govulncheck: + name: govulncheck + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Run govulncheck + uses: golang/govulncheck-action@v1 + with: + go-version-file: go.mod + go-package: ./... + repo-checkout: false diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml new file mode 100644 index 0000000..9db6ba2 --- /dev/null +++ b/.github/workflows/codeql.yml @@ -0,0 +1,35 @@ +name: CodeQL + +on: + push: + branches: + - main + pull_request: + branches: + - main + schedule: + - cron: "0 20 * * 0" + +permissions: + actions: read + contents: read + security-events: write + +jobs: + analyze: + name: Analyze + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v6 + + - name: Initialize CodeQL + uses: github/codeql-action/init@v3 + with: + languages: go + + - name: Autobuild + uses: github/codeql-action/autobuild@v3 + + - name: Perform CodeQL analysis + uses: github/codeql-action/analyze@v3 diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml deleted file mode 100644 index b0c9ba2..0000000 --- a/.github/workflows/lint.yml +++ /dev/null @@ -1,50 +0,0 @@ -name: lint -on: - push: - branches: - - main - pull_request: - -permissions: - contents: read - # Optional: allow read access to pull request. Use with `only-new-issues` option. - # pull-requests: read -jobs: - golangci: - name: lint - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v5 - - uses: actions/setup-go@v6 - with: - go-version-file: 'go.mod' - - name: golangci-lint - uses: golangci/golangci-lint-action@v9 - with: - # Optional: version of golangci-lint to use in form of v1.2 or v1.2.3 or `latest` to use the latest version - version: latest - - # Optional: working directory, useful for monorepos - # working-directory: somedir - - # Optional: golangci-lint command line arguments. - args: --enable gosec - - # Optional: show only new issues if it's a pull request. The default value is `false`. - # only-new-issues: true - - # Optional: if set to true then the all caching functionality will be complete disabled, - # takes precedence over all other caching options. - # skip-cache: true - - # Optional: if set to true then the action don't cache or restore ~/go/pkg. - # skip-pkg-cache: true - - # Optional: if set to true then the action don't cache or restore ~/.cache/go-build. - # skip-build-cache: true - - name: testing - run: go test -shuffle on ./... - - name: install govulncheck - run: go install golang.org/x/vuln/cmd/govulncheck@latest - - name: running govulncheck - run: govulncheck ./... diff --git a/README.md b/README.md index 7186c97..739741a 100644 --- a/README.md +++ b/README.md @@ -1,179 +1,109 @@ -# [webinfo] -- Extract metadata and structured information from web pages +# [webinfo] -- Extract metadata from web pages -[](https://github.com/goark/webinfo/actions) -[](https://raw.githubusercontent.com/goark/webinfo/master/LICENSE) -[](https://github.com/goark/webinfo/releases/latest) +[](https://github.com/goark/webinfo/actions) +[](https://github.com/goark/webinfo/actions) +[](https://raw.githubusercontent.com/goark/webinfo/main/LICENSE) +[](https://github.com/goark/webinfo/releases/latest) [](https://pkg.go.dev/github.com/goark/webinfo) -[`webinfo`][webinfo] is a small Go module that extracts common metadata from web pages and provides utilities -to download representative images and create thumbnails. +`webinfo` extracts common metadata (title, description, canonical, image, etc.) +from web pages and provides helpers to download images and generate thumbnails. -## Quick overview +## Design goals -- **Package**: `webinfo` -- **Repository**: `github.com/goark/webinfo` -- **Purpose**: fetch page metadata (title, description, canonical, image, etc.) and download images +- Keep metadata extraction simple and deterministic. +- Use clear precedence rules for HTML/meta parsing. +- Provide practical image utilities with minimal API surface. +- Keep context-aware network operations as the default style. -## Features +## Development -- Fetch page metadata with `Fetch` (handles encodings and meta tag precedence). -- Download an image referenced by `Webinfo.ImageURL` using `(*Webinfo).DownloadImage`. -- Create a thumbnail from the referenced image using `(*Webinfo).DownloadThumbnail`. +### Requirements -## Install +- Go 1.25.10 or later +- [Task](https://taskfile.dev/) command (local tool for this repository) -Use Go modules (Go 1.25+ as used by the project): +### Local validation -```bash -go get github.com/goark/webinfo@latest +```text +task test +task govulncheck ``` -## Basic usage - -Example showing fetch and download thumbnail (error handling omitted for brevity): +Run all maintenance tasks: -```go -package main - -import ( - "context" - "fmt" - - "github.com/goark/webinfo" -) - -func main() { - ctx := context.Background() - // Fetch metadata for a page (empty UA uses default) - info, err := webinfo.Fetch(ctx, "https://text.baldanders.info/", "") - if err != nil { - fmt.Printf("error detail:\n%+v\n", err) - return - } - - // Download thumbnail: width 150, to directory "thumbnails", permanent file - thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false) - if err != nil { - fmt.Printf("error detail:\n%+v\n", err) - return - } - fmt.Println("thumbnail saved:", thumbPath) -} +```text +task ``` -### API notes - -- `Fetch(ctx, url, userAgent)` — Parse and extract metadata. Pass an empty userAgent to use the module default. -- `(*Webinfo).DownloadImage(ctx, destDir, temporary)` — Download the image in `Webinfo.ImageURL` and save it. If - `temporary` is true (or `destDir` is empty), a temporary file is created. -- `(*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary)` — Download the referenced image and produce a - thumbnail resized to `width` pixels (height is preserved by aspect ratio). If `destDir` is empty the method - creates a temporary file; when `temporary` is false the thumbnail file is named based on the original image - name with `-thumb` appended before the extension. +## CI Workflows -Note on defaults and test hooks: +- `ci`: lint (`golangci-lint` with `gosec`), tests, and `govulncheck` +- `CodeQL`: scheduled and push/PR static analysis -- **Default width**: If `width <= 0` is passed to `DownloadThumbnail`, the method uses a default width of 150 pixels. -- **Extension detection**: `DownloadImage` determines an output extension from the URL path, the response - `Content-Type` (via `mime.ExtensionsByType`), or by sniffing up to the first 512 bytes with `http.DetectContentType`. -- **Test hooks / injection points**: For easier testing the package exposes a few package-level variables that - tests can override: - - `createFile`: used to create temporary or permanent files (wraps `os.CreateTemp` / `os.Create`). Override to - simulate file-creation failures. - - `decodeImage`: wrapper around `image.Decode` used by `DownloadThumbnail` — override to simulate decode results - (for example, to return a zero-dimension image). - - `outputImage`: encoder that writes the thumbnail image to disk (wraps `jpeg.Encode`, `png.Encode`, etc.). - Override to simulate encoder failures. +## Usage -These hooks are intended for tests and let callers reproduce rare I/O or encoding failures without changing -production behavior. +### Install and import -- **HTTP client timeout**: `DownloadImage` uses an HTTP client with a default 30-second `Timeout` for the whole - request; tests can override this by replacing the `newHTTPClient` package variable. - -## Test examples - -Below are short examples showing how to override the package-level hooks from a test to simulate failures. -These snippets are intended for `*_test.go` files and assume the usual `testing` and `net/http/httptest` helpers. - -1) Simulate thumbnail temporary-file creation failure (override `createFile`): +```bash +go get github.com/goark/webinfo@latest +``` ```go -// in your test function -orig := createFile -defer func() { createFile = orig }() -createFile = func(temp bool, dir, pattern string) (*os.File, error) { - // fail only for thumbnail temp pattern - if temp && strings.Contains(pattern, "webinfo-thumb-") { - return nil, errors.New("simulated thumbnail temp create failure") - } - return orig(temp, dir, pattern) -} - -// then call the method under test -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +import "github.com/goark/webinfo" ``` -2) Simulate a zero-dimension decoded image (override `decodeImage`): +### Fetch metadata ```go -origDecode := decodeImage -defer func() { decodeImage = origDecode }() -decodeImage = func(r io.Reader) (image.Image, string, error) { - // return an image with zero width to hit the origW==0 error path - return image.NewRGBA(image.Rect(0, 0, 0, 10)), "png", nil +ctx := context.Background() +info, err := webinfo.Fetch(ctx, "https://example.com", "") +if err != nil { + return err } - -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +fmt.Println(info.Title, info.Description) ``` -3) Simulate encoder failure when writing thumbnails (override `outputImage`): +### Download image and thumbnail ```go -origOut := outputImage -defer func() { outputImage = origOut }() -outputImage = func(dst *os.File, src *image.RGBA, format string) error { - return errors.New("simulated encode failure") +imgPath, err := info.DownloadImage(ctx, "images", true) +if err != nil { + return err } -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false) +if err != nil { + return err +} ``` -Notes: -- Ensure your test imports include `errors`, `io`, `image`, and `strings` as needed. -- Restore the original variables with `defer` to avoid cross-test interference. -- These examples are intentionally minimal — adapt them to your test fixtures (httptest servers, temp dirs, etc.). +### Public API -4) Simulate HTTP client timeout by overriding `newHTTPClient`: - -```go -origClient := newHTTPClient -defer func() { newHTTPClient = origClient }() -newHTTPClient = func() *http.Client { - // short timeout for test - return &http.Client{Timeout: 50 * time.Millisecond} -} - -// then call DownloadImage which uses newHTTPClient() -_, err := info.DownloadImage(ctx, t.TempDir(), true) -// assert err != nil (expect timeout) -``` +- `Fetch(ctx, rawURL, userAgent)` extracts metadata from a page. +- `(*Webinfo).DownloadImage(ctx, destDir, temporary)` downloads `Webinfo.ImageURL`. +- `(*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary)` creates a resized thumbnail. -### Error handling +## Behavior notes -The package uses `github.com/goark/errs` for wrapping errors with contextual keys (e.g. `url`, `path`, `dir`). -Callers should inspect returned errors accordingly. +- `Fetch` uses explicit precedence for metadata extraction: + - title: `title` -> `twitter:title` -> `og:title` + - description: `meta[name=description]` -> `twitter:description` -> `og:description` + - image: `twitter:image` -> `og:image` +- `DownloadImage` resolves extension in this order: + 1. URL path extension + 2. response `Content-Type` + 3. sniff first 512 bytes (`http.DetectContentType`) + 4. fallback `.img` +- `DownloadThumbnail` uses width `150` when `width <= 0`. -### Tests & development +## Error handling -- Run all tests: `go test ./...` -- The repository includes `Taskfile.yml` tasks for common workflows; see that file for CI/test commands. +This package wraps errors with `github.com/goark/errs` and attaches context +values such as `url`, `path`, and `dir`. ## Modules Requirement Graph [](./dependency.png) -[webinfo]: https://github.com/goark/webinfo "goark/webinfo: Extract metadata and structured information from web pages" +[webinfo]: https://github.com/goark/webinfo "goark/webinfo" diff --git a/Taskfile.yml b/Taskfile.yml index 4731e6a..1ac0e09 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -5,36 +5,29 @@ tasks: cmds: - task: prepare - task: test - # - task: nancy + - task: govulncheck - task: graph - build-all: - desc: Build executable binary with GoReleaser. - cmds: - - goreleaser --snapshot --skip=publish --clean - test: desc: Test and lint. cmds: - go mod verify - - go test -shuffle on ./... -coverprofile=coverage.out -cover - - go tool cover -func=coverage.out - - govulncheck ./... - - golangci-lint-v2 run --enable gosec --timeout 10m0s ./... + - go test -shuffle on ./... + - golangci-lint-v2 run --enable gosec --timeout 3m0s ./... sources: - ./go.mod - '**/*.go' - nancy: - desc: Check vulnerability of external packages with Nancy. + govulncheck: + desc: Check reachable vulnerabilities with latest govulncheck. cmds: - - depm list -j | nancy sleuth -n + - go run golang.org/x/vuln/cmd/govulncheck@latest ./... sources: - ./go.mod - '**/*.go' prepare: - - go mod tidy -v -go=1.25 + - go mod tidy -v -go=1.25.10 clean: desc: Initialize module and build cache, and remake go.sum file. @@ -52,3 +45,4 @@ tasks: - '**/*.go' generates: - ./dependency.png + diff --git a/errs.go b/errs.go index 1b5cb02..92867d5 100644 --- a/errs.go +++ b/errs.go @@ -8,7 +8,7 @@ var ( ErrInvalidURL = errors.New("invalid URL") ) -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/fetch.go b/fetch.go index 8c59d45..0ec7812 100644 --- a/fetch.go +++ b/fetch.go @@ -15,39 +15,18 @@ import ( "golang.org/x/net/html/charset" ) -// Fetch retrieves metadata from the web page at urlStr and returns it as a *Webinfo. +// Fetch retrieves metadata from a web page and returns it as Webinfo. // -// Behavior: -// - Parses urlStr and performs an HTTP GET using the provided context (ctx). -// - If userAgent is empty, a default dummy User-Agent string is used. -// - Uses an HTTP client and sets the User-Agent request header. -// - Reads up to the first 1024 bytes of the response to detect the page character -// encoding via charset.DetermineEncoding (also considers the response Content-Type). -// If an encoding is detected or inferred by name, the response body is decoded -// accordingly before HTML parsing. +// It fetches the page with the given context and User-Agent (or a default one when +// empty), peeks up to 1024 bytes to determine encoding, then parses the head +// section with goquery. // -// Parsing and extracted fields: -// - Parses the document head with goquery and extracts: -// - Title: from