diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index d0bcb14..0ad4e93 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1,85 +1,83 @@ -Purpose -------- -This file gives concise, actionable guidance for AI coding agents working on the `webinfo` Go module. - -What this project does ----------------------- -Extracts metadata (title, description, canonical, image, etc.) from web pages and provides utilities -to fetch and save representative images and create thumbnails. - -Quick entry points ------------------- -- **Primary package**: `webinfo` — key files: - - `fetch.go` (core `Fetch` function and encoding handling) - - `webinfo.go` (`Webinfo` type, `DownloadImage`, and `DownloadThumbnail`) - - `errs.go` (error sentinel values) - - `fetch_test.go` (behavioral tests and examples) -- **Go module**: `go 1.25` (see `go.mod`). - -Developer workflows -------------------- -- Run full CI/test workflow using the Taskfile (recommended if `task` is installed): - - `task test` — runs `go mod verify`, `go test -shuffle on ./...`, `govulncheck`, and `golangci-lint-v2` as configured in `Taskfile.yml`. -- Quick test: `go test ./...` (useful during fast iteration). -- Prepare module: `go mod tidy -v -go=1.25` (mirrors `prepare` in `Taskfile.yml`). - -Project-specific conventions and patterns ----------------------------------------- -- Error handling: uses `github.com/goark/errs`. Prefer `errs.Wrap(err, errs.WithContext("key", val))` for context-rich errors and `errs.Join` when combining close errors in `defer`. -- HTTP fetching: uses `github.com/goark/fetch`. Typical pattern: - - Parse URL with `fetch.URL(...)`. - - Use `fetch.New(...).GetWithContext(ctx, parsed, fetch.WithRequestHeaderSet("User-Agent", ua))`. -- Default User-Agent: `getUserAgent("")` returns a dummy UA string. Functions accept a `userAgent` param but fall back to this default. -- Encoding: `Fetch` peeks the first 1024 bytes and uses `charset.DetermineEncoding` and `encoding.GetEncoding(name)` to decode response bodies before HTML parsing — preserve this approach when touching parsing logic. -- HTML parsing: `goquery` is used to select head elements and meta tags. Extraction precedence is explicit in `fetch.go` (title → `twitter:title`/`og:title`, description → `twitter:description`/`og:description`, image → `twitter:image`/`og:image`). Follow this precedence in code changes or tests. - -Image download and thumbnail notes ---------------------------------- -- `DownloadImage` (in `webinfo.go`) downloads `w.ImageURL` and saves it to disk. It determines the output file extension using this order: - 1) extension from the URL path, - 2) extensions inferred from the response `Content-Type` header, - 3) sniffing the first up to 512 bytes via `http.DetectContentType`, - 4) fallback to `.img` if none found. - When sniffing, the read bytes are prepended back into the response body with `io.MultiReader` so the full image is written. -- `DownloadThumbnail` (added to `webinfo.go`) downloads the original image (via `DownloadImage`), resizes it to a requested width (preserving aspect ratio) and writes a thumbnail. Implementation notes: - - The code currently uses a local nearest-neighbor scaler (no external `x/image/draw` dependency) to avoid adding module requirements. - - The method accepts `width` (default 150 when <= 0), `destDir`, and `temporary` flags. When `destDir` is empty the method forces creation of a temporary file. - - When `temporary` is false, the thumbnail filename is derived from the original image basename with `-thumb` appended before the extension. - -I/O and cleanup ----------------- -- Response bodies and files are closed; close errors are wrapped/joined with any existing error. -- Errors encountered while parsing the URL, fetching, reading, sniffing, creating directories/files, or copying data are wrapped with contextual information (e.g. `"url"`, `"path"`, `"dir"`, `"file"`) using the `errs` package. - -Tests and examples ------------------- -- Tests use `net/http/httptest` for deterministic responses (encoding tests use `golang.org/x/text/encoding/japanese`). Inspect `fetch_test.go` for examples of: - - Redirect handling and validation of `Location`. - - Encoding tests for Shift_JIS and ISO-2022-JP. - - Verifying `User-Agent` header usage. -- Example usage patterns to follow when adding code or tests: - - Fetch: `info, err := Fetch(ctx, "https://example.com", "")` — empty UA uses the default. - - Download image: `outPath, err := w.DownloadImage(ctx, "images", true)` - - Download thumbnail: `thumbPath, err := w.DownloadThumbnail(ctx, "thumbnails", 150, false)` - -External dependencies & integration points ----------------------------------------- -- Key dependencies in `go.mod`: `github.com/goark/fetch`, `github.com/goark/errs`, `github.com/PuerkitoBio/goquery`, `golang.org/x/text` (encodings). -- The repository intentionally avoids adding `golang.org/x/image/draw` as a dependency; if you need higher-quality scaling consider adding it and updating `go.mod` and tests. -- The `Taskfile.yml` runs additional tools: `govulncheck`, `golangci-lint-v2`, and (optionally) `nancy` via `depm` — keep CI tool invocations in sync when adding dependencies. - -When modifying public APIs -------------------------- -- Maintain existing error-wrapping conventions (`errs.Wrap`, `errs.WithContext`). -- Preserve encoding detection behavior and the 1024-byte peek in `Fetch` unless a clear, tested performance reason exists. -- Preserve `DownloadImage`'s extension-detection order and the behavior of `temporary` vs permanent files. When adding `DownloadThumbnail` behavior or changing file-naming semantics, update tests accordingly. - -Where to look next (high-value files) -------------------------------------- -- `fetch.go` — how pages are fetched, decoded and parsed. -- `webinfo.go` — `Webinfo` type, `DownloadImage`, and `DownloadThumbnail` implementations. -- `fetch_test.go` — canonical tests and examples you should mirror for new behaviors. -- `errs.go` and `go.mod` — error constants and dependency hints. -- `Taskfile.yml` — canonical developer/test/lint workflow. - -If anything above is unclear or you want small patches, test templates, or a CI-safe refactor suggestion, tell me which area to expand and I will iterate. +# Copilot Instructions for `goark/webinfo` + +## Project purpose + +`webinfo` extracts metadata from web pages and provides helpers for +image download and thumbnail generation. + +## Design principles + +- Keep public APIs small and explicit. +- Preserve metadata extraction precedence and deterministic behavior. +- Keep context-based fetch operations as the default path. +- Preserve compatibility of exported symbols when possible. + +## Error handling + +- Use `github.com/goark/errs` for internal error handling. +- Prefer `errs.Wrap`, `errs.Join`, and `errs.WithContext`. +- Keep `errors.Is` compatibility for callers. +- Keep sentinel errors stable (`ErrInvalidURL`, `ErrNoImageURL`, `ErrNullPointer`). +- Include useful context keys such as `url`, `path`, and `dir`. + +## Fetch and parsing behavior + +- Use `github.com/goark/fetch` for HTTP operations. +- Keep the default User-Agent fallback behavior. +- Preserve encoding detection flow in `Fetch` (1024-byte peek + charset detection). +- Keep extraction precedence unchanged: + - title: `title` -> `twitter:title` -> `og:title` + - description: `meta[name=description]` -> `twitter:description` -> `og:description` + - image: `twitter:image` -> `og:image` + +## Image and thumbnail behavior + +- Keep `DownloadImage` extension detection order: + 1) URL path extension + 2) `Content-Type` based extension + 3) content sniffing (`http.DetectContentType`) + 4) fallback `.img` +- Keep the sniffed bytes prepended back to the body reader. +- Keep temporary/permanent file behavior stable. +- Keep thumbnail default width behavior (`width <= 0` -> `150`). + +## Coding style + +- Write idiomatic Go with straightforward control flow. +- Avoid unnecessary dependencies. +- Keep comments concise and in English. + +## Testing and validation + +- Add or update tests for behavior changes. +- Prefer local validation with Taskfile targets: + - `task test` + - `task govulncheck` + +## Documentation + +- Keep `README.md` aligned with public API behavior. +- Keep examples concise and runnable. + +## Release process + +- Create release tags from `main`. +- Use semantic versioning tags in `vMAJOR.MINOR.PATCH` format. +- Ensure repository is clean and synced before tagging. + +Release steps: + +1. Ensure `main` is up to date. +2. Create annotated tag: + - `git tag -a vX.Y.Z -m "Release vX.Y.Z"` +3. Push tag: + - `git push origin vX.Y.Z` +4. Create GitHub release with autogenerated notes: + - `gh release create vX.Y.Z --generate-notes` + +Verification steps: + +- Check tag exists: + - `git tag -l "vX.Y.Z"` +- Check release exists: + - `gh release view vX.Y.Z` diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..43e5071 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,44 @@ +name: ci + +on: + push: + branches: + - main + pull_request: + +permissions: + contents: read + +jobs: + test-and-lint: + name: lint and test + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - uses: actions/setup-go@v6 + with: + go-version-file: go.mod + cache-dependency-path: go.sum + + - name: golangci-lint + uses: golangci/golangci-lint-action@v9 + with: + version: latest + args: --enable gosec + + - name: Test module + run: go test -shuffle on ./... + + govulncheck: + name: govulncheck + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Run govulncheck + uses: golang/govulncheck-action@v1 + with: + go-version-file: go.mod + go-package: ./... + repo-checkout: false diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml new file mode 100644 index 0000000..9db6ba2 --- /dev/null +++ b/.github/workflows/codeql.yml @@ -0,0 +1,35 @@ +name: CodeQL + +on: + push: + branches: + - main + pull_request: + branches: + - main + schedule: + - cron: "0 20 * * 0" + +permissions: + actions: read + contents: read + security-events: write + +jobs: + analyze: + name: Analyze + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v6 + + - name: Initialize CodeQL + uses: github/codeql-action/init@v3 + with: + languages: go + + - name: Autobuild + uses: github/codeql-action/autobuild@v3 + + - name: Perform CodeQL analysis + uses: github/codeql-action/analyze@v3 diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml deleted file mode 100644 index b0c9ba2..0000000 --- a/.github/workflows/lint.yml +++ /dev/null @@ -1,50 +0,0 @@ -name: lint -on: - push: - branches: - - main - pull_request: - -permissions: - contents: read - # Optional: allow read access to pull request. Use with `only-new-issues` option. - # pull-requests: read -jobs: - golangci: - name: lint - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v5 - - uses: actions/setup-go@v6 - with: - go-version-file: 'go.mod' - - name: golangci-lint - uses: golangci/golangci-lint-action@v9 - with: - # Optional: version of golangci-lint to use in form of v1.2 or v1.2.3 or `latest` to use the latest version - version: latest - - # Optional: working directory, useful for monorepos - # working-directory: somedir - - # Optional: golangci-lint command line arguments. - args: --enable gosec - - # Optional: show only new issues if it's a pull request. The default value is `false`. - # only-new-issues: true - - # Optional: if set to true then the all caching functionality will be complete disabled, - # takes precedence over all other caching options. - # skip-cache: true - - # Optional: if set to true then the action don't cache or restore ~/go/pkg. - # skip-pkg-cache: true - - # Optional: if set to true then the action don't cache or restore ~/.cache/go-build. - # skip-build-cache: true - - name: testing - run: go test -shuffle on ./... - - name: install govulncheck - run: go install golang.org/x/vuln/cmd/govulncheck@latest - - name: running govulncheck - run: govulncheck ./... diff --git a/README.md b/README.md index 7186c97..739741a 100644 --- a/README.md +++ b/README.md @@ -1,179 +1,109 @@ -# [webinfo] -- Extract metadata and structured information from web pages +# [webinfo] -- Extract metadata from web pages -[![lint status](https://github.com/goark/webinfo/workflows/lint/badge.svg)](https://github.com/goark/webinfo/actions) -[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/goark/webinfo/master/LICENSE) -[![GitHub release](http://img.shields.io/github/release/goark/webinfo.svg)](https://github.com/goark/webinfo/releases/latest) +[![ci status](https://github.com/goark/webinfo/workflows/ci/badge.svg)](https://github.com/goark/webinfo/actions) +[![codeql status](https://github.com/goark/webinfo/workflows/CodeQL/badge.svg)](https://github.com/goark/webinfo/actions) +[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/goark/webinfo/main/LICENSE) +[![GitHub release](https://img.shields.io/github/release/goark/webinfo.svg)](https://github.com/goark/webinfo/releases/latest) [![Go reference](https://pkg.go.dev/badge/github.com/goark/webinfo.svg)](https://pkg.go.dev/github.com/goark/webinfo) -[`webinfo`][webinfo] is a small Go module that extracts common metadata from web pages and provides utilities -to download representative images and create thumbnails. +`webinfo` extracts common metadata (title, description, canonical, image, etc.) +from web pages and provides helpers to download images and generate thumbnails. -## Quick overview +## Design goals -- **Package**: `webinfo` -- **Repository**: `github.com/goark/webinfo` -- **Purpose**: fetch page metadata (title, description, canonical, image, etc.) and download images +- Keep metadata extraction simple and deterministic. +- Use clear precedence rules for HTML/meta parsing. +- Provide practical image utilities with minimal API surface. +- Keep context-aware network operations as the default style. -## Features +## Development -- Fetch page metadata with `Fetch` (handles encodings and meta tag precedence). -- Download an image referenced by `Webinfo.ImageURL` using `(*Webinfo).DownloadImage`. -- Create a thumbnail from the referenced image using `(*Webinfo).DownloadThumbnail`. +### Requirements -## Install +- Go 1.25.10 or later +- [Task](https://taskfile.dev/) command (local tool for this repository) -Use Go modules (Go 1.25+ as used by the project): +### Local validation -```bash -go get github.com/goark/webinfo@latest +```text +task test +task govulncheck ``` -## Basic usage - -Example showing fetch and download thumbnail (error handling omitted for brevity): +Run all maintenance tasks: -```go -package main - -import ( - "context" - "fmt" - - "github.com/goark/webinfo" -) - -func main() { - ctx := context.Background() - // Fetch metadata for a page (empty UA uses default) - info, err := webinfo.Fetch(ctx, "https://text.baldanders.info/", "") - if err != nil { - fmt.Printf("error detail:\n%+v\n", err) - return - } - - // Download thumbnail: width 150, to directory "thumbnails", permanent file - thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false) - if err != nil { - fmt.Printf("error detail:\n%+v\n", err) - return - } - fmt.Println("thumbnail saved:", thumbPath) -} +```text +task ``` -### API notes - -- `Fetch(ctx, url, userAgent)` — Parse and extract metadata. Pass an empty userAgent to use the module default. -- `(*Webinfo).DownloadImage(ctx, destDir, temporary)` — Download the image in `Webinfo.ImageURL` and save it. If - `temporary` is true (or `destDir` is empty), a temporary file is created. -- `(*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary)` — Download the referenced image and produce a - thumbnail resized to `width` pixels (height is preserved by aspect ratio). If `destDir` is empty the method - creates a temporary file; when `temporary` is false the thumbnail file is named based on the original image - name with `-thumb` appended before the extension. +## CI Workflows -Note on defaults and test hooks: +- `ci`: lint (`golangci-lint` with `gosec`), tests, and `govulncheck` +- `CodeQL`: scheduled and push/PR static analysis -- **Default width**: If `width <= 0` is passed to `DownloadThumbnail`, the method uses a default width of 150 pixels. -- **Extension detection**: `DownloadImage` determines an output extension from the URL path, the response - `Content-Type` (via `mime.ExtensionsByType`), or by sniffing up to the first 512 bytes with `http.DetectContentType`. -- **Test hooks / injection points**: For easier testing the package exposes a few package-level variables that - tests can override: - - `createFile`: used to create temporary or permanent files (wraps `os.CreateTemp` / `os.Create`). Override to - simulate file-creation failures. - - `decodeImage`: wrapper around `image.Decode` used by `DownloadThumbnail` — override to simulate decode results - (for example, to return a zero-dimension image). - - `outputImage`: encoder that writes the thumbnail image to disk (wraps `jpeg.Encode`, `png.Encode`, etc.). - Override to simulate encoder failures. +## Usage -These hooks are intended for tests and let callers reproduce rare I/O or encoding failures without changing -production behavior. +### Install and import -- **HTTP client timeout**: `DownloadImage` uses an HTTP client with a default 30-second `Timeout` for the whole - request; tests can override this by replacing the `newHTTPClient` package variable. - -## Test examples - -Below are short examples showing how to override the package-level hooks from a test to simulate failures. -These snippets are intended for `*_test.go` files and assume the usual `testing` and `net/http/httptest` helpers. - -1) Simulate thumbnail temporary-file creation failure (override `createFile`): +```bash +go get github.com/goark/webinfo@latest +``` ```go -// in your test function -orig := createFile -defer func() { createFile = orig }() -createFile = func(temp bool, dir, pattern string) (*os.File, error) { - // fail only for thumbnail temp pattern - if temp && strings.Contains(pattern, "webinfo-thumb-") { - return nil, errors.New("simulated thumbnail temp create failure") - } - return orig(temp, dir, pattern) -} - -// then call the method under test -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +import "github.com/goark/webinfo" ``` -2) Simulate a zero-dimension decoded image (override `decodeImage`): +### Fetch metadata ```go -origDecode := decodeImage -defer func() { decodeImage = origDecode }() -decodeImage = func(r io.Reader) (image.Image, string, error) { - // return an image with zero width to hit the origW==0 error path - return image.NewRGBA(image.Rect(0, 0, 0, 10)), "png", nil +ctx := context.Background() +info, err := webinfo.Fetch(ctx, "https://example.com", "") +if err != nil { + return err } - -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +fmt.Println(info.Title, info.Description) ``` -3) Simulate encoder failure when writing thumbnails (override `outputImage`): +### Download image and thumbnail ```go -origOut := outputImage -defer func() { outputImage = origOut }() -outputImage = func(dst *os.File, src *image.RGBA, format string) error { - return errors.New("simulated encode failure") +imgPath, err := info.DownloadImage(ctx, "images", true) +if err != nil { + return err } -_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true) -// assert err != nil +thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false) +if err != nil { + return err +} ``` -Notes: -- Ensure your test imports include `errors`, `io`, `image`, and `strings` as needed. -- Restore the original variables with `defer` to avoid cross-test interference. -- These examples are intentionally minimal — adapt them to your test fixtures (httptest servers, temp dirs, etc.). +### Public API -4) Simulate HTTP client timeout by overriding `newHTTPClient`: - -```go -origClient := newHTTPClient -defer func() { newHTTPClient = origClient }() -newHTTPClient = func() *http.Client { - // short timeout for test - return &http.Client{Timeout: 50 * time.Millisecond} -} - -// then call DownloadImage which uses newHTTPClient() -_, err := info.DownloadImage(ctx, t.TempDir(), true) -// assert err != nil (expect timeout) -``` +- `Fetch(ctx, rawURL, userAgent)` extracts metadata from a page. +- `(*Webinfo).DownloadImage(ctx, destDir, temporary)` downloads `Webinfo.ImageURL`. +- `(*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary)` creates a resized thumbnail. -### Error handling +## Behavior notes -The package uses `github.com/goark/errs` for wrapping errors with contextual keys (e.g. `url`, `path`, `dir`). -Callers should inspect returned errors accordingly. +- `Fetch` uses explicit precedence for metadata extraction: + - title: `title` -> `twitter:title` -> `og:title` + - description: `meta[name=description]` -> `twitter:description` -> `og:description` + - image: `twitter:image` -> `og:image` +- `DownloadImage` resolves extension in this order: + 1. URL path extension + 2. response `Content-Type` + 3. sniff first 512 bytes (`http.DetectContentType`) + 4. fallback `.img` +- `DownloadThumbnail` uses width `150` when `width <= 0`. -### Tests & development +## Error handling -- Run all tests: `go test ./...` -- The repository includes `Taskfile.yml` tasks for common workflows; see that file for CI/test commands. +This package wraps errors with `github.com/goark/errs` and attaches context +values such as `url`, `path`, and `dir`. ## Modules Requirement Graph [![dependency.png](./dependency.png)](./dependency.png) -[webinfo]: https://github.com/goark/webinfo "goark/webinfo: Extract metadata and structured information from web pages" +[webinfo]: https://github.com/goark/webinfo "goark/webinfo" diff --git a/Taskfile.yml b/Taskfile.yml index 4731e6a..1ac0e09 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -5,36 +5,29 @@ tasks: cmds: - task: prepare - task: test - # - task: nancy + - task: govulncheck - task: graph - build-all: - desc: Build executable binary with GoReleaser. - cmds: - - goreleaser --snapshot --skip=publish --clean - test: desc: Test and lint. cmds: - go mod verify - - go test -shuffle on ./... -coverprofile=coverage.out -cover - - go tool cover -func=coverage.out - - govulncheck ./... - - golangci-lint-v2 run --enable gosec --timeout 10m0s ./... + - go test -shuffle on ./... + - golangci-lint-v2 run --enable gosec --timeout 3m0s ./... sources: - ./go.mod - '**/*.go' - nancy: - desc: Check vulnerability of external packages with Nancy. + govulncheck: + desc: Check reachable vulnerabilities with latest govulncheck. cmds: - - depm list -j | nancy sleuth -n + - go run golang.org/x/vuln/cmd/govulncheck@latest ./... sources: - ./go.mod - '**/*.go' prepare: - - go mod tidy -v -go=1.25 + - go mod tidy -v -go=1.25.10 clean: desc: Initialize module and build cache, and remake go.sum file. @@ -52,3 +45,4 @@ tasks: - '**/*.go' generates: - ./dependency.png + diff --git a/errs.go b/errs.go index 1b5cb02..92867d5 100644 --- a/errs.go +++ b/errs.go @@ -8,7 +8,7 @@ var ( ErrInvalidURL = errors.New("invalid URL") ) -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/fetch.go b/fetch.go index 8c59d45..0ec7812 100644 --- a/fetch.go +++ b/fetch.go @@ -15,39 +15,18 @@ import ( "golang.org/x/net/html/charset" ) -// Fetch retrieves metadata from the web page at urlStr and returns it as a *Webinfo. +// Fetch retrieves metadata from a web page and returns it as Webinfo. // -// Behavior: -// - Parses urlStr and performs an HTTP GET using the provided context (ctx). -// - If userAgent is empty, a default dummy User-Agent string is used. -// - Uses an HTTP client and sets the User-Agent request header. -// - Reads up to the first 1024 bytes of the response to detect the page character -// encoding via charset.DetermineEncoding (also considers the response Content-Type). -// If an encoding is detected or inferred by name, the response body is decoded -// accordingly before HTML parsing. +// It fetches the page with the given context and User-Agent (or a default one when +// empty), peeks up to 1024 bytes to determine encoding, then parses the head +// section with goquery. // -// Parsing and extracted fields: -// - Parses the document head with goquery and extracts: -// - Title: from , then overridden by meta[property="twitter:title"] or meta[property="og:title"] if present. -// - Description: from meta[name="description"], then overridden by meta[property="twitter:description"] or meta[property="og:description"]. -// - ImageURL: from meta[property="twitter:image"] or meta[property="og:image"]. -// - Canonical: from link[rel="canonical"]. +// Extraction precedence is kept explicit: +// title: title -> twitter:title -> og:title +// description: meta[name=description] -> twitter:description -> og:description +// image: twitter:image -> og:image // -// - The returned Webinfo contains at least: -// - URL: the original urlStr (string form). -// - Location: the final request URL (after redirects) from the response. -// - UserAgent: the User-Agent actually used. -// -// Error handling and resource cleanup: -// - Network, URL parsing, encoding detection, and HTML parsing errors are wrapped with contextual information (including the URL). -// - The response body is closed in a deferred function; any close error is joined with the returned error. -// - On error, Fetch returns a nil *Webinfo and a non-nil error. -// -// Notes and guarantees: -// - The first 1024 bytes are peeked (without advancing the reader) to determine encoding. -// - DetermineEncoding's boolean return value is ignored (some encodings like Shift_JIS may be reported inconsistently); the detected encoding or a named encoding (via encoding.GetEncoding) is preferred. -// - The function honors context cancellation for the HTTP request. -// - Caller should assume that a non-nil *Webinfo is returned only on success; otherwise, info is nil. +// Returned errors are wrapped with context. Response close errors are joined. func Fetch(ctx context.Context, urlStr, userAgent string) (info *Webinfo, err error) { // check arguments parsed, uerr := fetch.URL(strings.TrimSpace(urlStr)) @@ -158,10 +137,8 @@ func Fetch(ctx context.Context, urlStr, userAgent string) (info *Webinfo, err er return } -// getUserAgent returns a user-agent string to use for HTTP requests. -// It trims whitespace from the provided ua parameter; if the trimmed value is empty, -// it returns a default (dummy) User-Agent string ("Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0"). -// Otherwise, it returns the supplied ua unchanged. +// getUserAgent returns ua if non-empty after trimming; otherwise it returns +// the package default User-Agent string. func getUserAgent(ua string) string { if len(strings.TrimSpace(ua)) == 0 { return "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0" //dummy user-agent string @@ -169,7 +146,7 @@ func getUserAgent(ua string) string { return ua } -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/fetch_test.go b/fetch_test.go index 2e061b5..a0c4c5c 100644 --- a/fetch_test.go +++ b/fetch_test.go @@ -277,7 +277,7 @@ func TestFetch_ISO2022JP_Encoding(t *testing.T) { } } -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/go.mod b/go.mod index 8f3cfa3..8ebe869 100644 --- a/go.mod +++ b/go.mod @@ -1,15 +1,17 @@ module github.com/goark/webinfo -go 1.25 +go 1.25.10 + +toolchain go1.26.3 require ( - github.com/PuerkitoBio/goquery v1.11.0 - github.com/goark/errs v1.3.2 - github.com/goark/fetch v0.5.0 + github.com/PuerkitoBio/goquery v1.12.0 + github.com/goark/errs v1.3.4 + github.com/goark/fetch v0.5.3 github.com/mattn/go-encoding v0.0.2 - golang.org/x/image v0.33.0 - golang.org/x/net v0.47.0 - golang.org/x/text v0.31.0 + golang.org/x/image v0.40.0 + golang.org/x/net v0.54.0 + golang.org/x/text v0.37.0 ) require github.com/andybalholm/cascadia v1.3.3 // indirect diff --git a/go.sum b/go.sum index d9ec9e8..f65ed5a 100644 --- a/go.sum +++ b/go.sum @@ -1,11 +1,11 @@ -github.com/PuerkitoBio/goquery v1.11.0 h1:jZ7pwMQXIITcUXNH83LLk+txlaEy6NVOfTuP43xxfqw= -github.com/PuerkitoBio/goquery v1.11.0/go.mod h1:wQHgxUOU3JGuj3oD/QFfxUdlzW6xPHfqyHre6VMY4DQ= +github.com/PuerkitoBio/goquery v1.12.0 h1:pAcL4g3WRXekcB9AU/y1mbKez2dbY2AajVhtkO8RIBo= +github.com/PuerkitoBio/goquery v1.12.0/go.mod h1:802ej+gV2y7bbIhOIoPY5sT183ZW0YFofScC4q/hIpQ= github.com/andybalholm/cascadia v1.3.3 h1:AG2YHrzJIm4BZ19iwJ/DAua6Btl3IwJX+VI4kktS1LM= github.com/andybalholm/cascadia v1.3.3/go.mod h1:xNd9bqTn98Ln4DwST8/nG+H0yuB8Hmgu1YHNnWw0GeA= -github.com/goark/errs v1.3.2 h1:ifccNe1aK7Xezt4XVYwHUqalmnfhuphnEvh3FshCReQ= -github.com/goark/errs v1.3.2/go.mod h1:ZsQucxaDFVfSB8I99j4bxkDRfNOrlKINwg72QMuRWKw= -github.com/goark/fetch v0.5.0 h1:mZM4Gd3DfLXwrCjw/2rbUBnifW/vqihjV3HkGN3xKXI= -github.com/goark/fetch v0.5.0/go.mod h1:hv29ebMJTGgOL5hdZ05xxEyfjChqCEXRHH+PNOKh6IE= +github.com/goark/errs v1.3.4 h1:/+/xwF3UwXGxGGLurzBTaMMoryTBeaPfheJ1aW9cglA= +github.com/goark/errs v1.3.4/go.mod h1:4xM7rorwYQlqh9kUhfKpC5P7VAJW2KfvuQpYnTaU0ek= +github.com/goark/fetch v0.5.3 h1:ZwT5N04BSiPw2tF2gG5MXsmoSr+A/sxE52KJWy/aWzw= +github.com/goark/fetch v0.5.3/go.mod h1:jgu+bn1HN8AfEks+ENqiPJVF99Cvs7cb2dqSujhjsOE= github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= github.com/mattn/go-encoding v0.0.2 h1:OC1L+QXLJge9n7yIE3R5Os/UNasUeFvK3Sa4NjbDi6c= github.com/mattn/go-encoding v0.0.2/go.mod h1:WUNsdPQLK4JYRzkn8IAdmYKFYGGJ4/9YPxdPoMumPgY= @@ -16,8 +16,8 @@ golang.org/x/crypto v0.13.0/go.mod h1:y6Z2r+Rw4iayiXXAIxJIDAJ1zMW4yaTpebo8fPOliY golang.org/x/crypto v0.19.0/go.mod h1:Iy9bg/ha4yyC70EfRS8jz+B6ybOBKMaSxLj6P6oBDfU= golang.org/x/crypto v0.23.0/go.mod h1:CKFgDieR+mRhux2Lsu27y0fO304Db0wZe70UKqHu0v8= golang.org/x/crypto v0.31.0/go.mod h1:kDsLvtWBEx7MV9tJOj9bnXsPbxwJQ6csT/x4KIN4Ssk= -golang.org/x/image v0.33.0 h1:LXRZRnv1+zGd5XBUVRFmYEphyyKJjQjCRiOuAP3sZfQ= -golang.org/x/image v0.33.0/go.mod h1:DD3OsTYT9chzuzTQt+zMcOlBHgfoKQb1gry8p76Y1sc= +golang.org/x/image v0.40.0 h1:Tw4GyDXMo+daZN1znreBRC3VayR1aLFUyUEOLUdW1a8= +golang.org/x/image v0.40.0/go.mod h1:uIc348UZMSvS5Z65CVZ7iDPaNobNFEPeJ4kbqTOszmA= golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4= golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= golang.org/x/mod v0.12.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= @@ -32,8 +32,8 @@ golang.org/x/net v0.15.0/go.mod h1:idbUs1IY1+zTqbi8yxTbhexhEEk5ur9LInksu6HrEpk= golang.org/x/net v0.21.0/go.mod h1:bIjVDfnllIU7BJ2DNgfnXvpSvtn8VRwhlsaeUTyUS44= golang.org/x/net v0.25.0/go.mod h1:JkAGAh7GEvH74S6FOH42FLoXpXbE/aqXSrIQjXgsiwM= golang.org/x/net v0.33.0/go.mod h1:HXLR5J+9DxmrqMwG9qjGCxZ+zKXxBru04zlTvWlWuN4= -golang.org/x/net v0.47.0 h1:Mx+4dIFzqraBXUugkia1OOvlD6LemFo1ALMHjrXDOhY= -golang.org/x/net v0.47.0/go.mod h1:/jNxtkgq5yWUGYkaZGqo27cfGZ1c5Nen03aYrrKpVRU= +golang.org/x/net v0.54.0 h1:2zJIZAxAHV/OHCDTCOHAYehQzLfSXuf/5SoL/Dv6w/w= +golang.org/x/net v0.54.0/go.mod h1:Sj4oj8jK6XmHpBZU/zWHw3BV3abl4Kvi+Ut7cQcY+cQ= golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= @@ -70,8 +70,8 @@ golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE= golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU= golang.org/x/text v0.15.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU= golang.org/x/text v0.21.0/go.mod h1:4IBbMaMmOPCJ8SecivzSH54+73PCFmPWxNTLm+vZkEQ= -golang.org/x/text v0.31.0 h1:aC8ghyu4JhP8VojJ2lEHBnochRno1sgL6nEi9WGFGMM= -golang.org/x/text v0.31.0/go.mod h1:tKRAlv61yKIjGGHX/4tP1LTbc13YSec1pxVEWXzfoeM= +golang.org/x/text v0.37.0 h1:Cqjiwd9eSg8e0QAkyCaQTNHFIIzWtidPahFWR83rTrc= +golang.org/x/text v0.37.0/go.mod h1:a5sjxXGs9hsn/AJVwuElvCAo9v8QYLzvavO5z2PiM38= golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= diff --git a/webinfo.go b/webinfo.go index 8efecf9..ba9feae 100644 --- a/webinfo.go +++ b/webinfo.go @@ -22,18 +22,8 @@ import ( "golang.org/x/image/draw" ) -// Webinfo holds common metadata extracted from a web page. -// It captures information useful for previews or link metadata: -// -// - URL: the original page URL. -// - Location: the location declared by the page (if any). -// - Canonical: the canonical URL declared by the page (if any). -// - Title: the page title. -// - Description: a short summary or meta description for the page. -// - ImageURL: a representative image URL suitable for previews. -// - UserAgent: the User-Agent string used to fetch the page. -// -// Fields may be empty or nil when the corresponding metadata is not present. +// Webinfo stores metadata extracted from a web page and values used for +// follow-up image download operations. type Webinfo struct { URL string `json:"url,omitempty"` // Original page URL Location string `json:"location,omitempty"` // Location @@ -44,51 +34,18 @@ type Webinfo struct { UserAgent string `json:"user_agent,omitempty"` // User-Agent used to fetch the page } -// DownloadImage downloads the image pointed to by w.ImageURL and saves it to destDir, -// returning the path of the saved file (outPath) or an error. -// -// Behavior: -// - The method is a receiver on *Webinfo and will return an error if w is nil or if -// ImageURL is empty. -// - ctx is used to control/cancel the underlying HTTP request. -// - destDir is cleaned with filepath.Clean. If it is non-empty, the directory (and any -// required parents) will be created with mode 0750. If destDir is empty, file -// creation uses the system/default behavior for temporary or current directories. -// - If `temporary` is true, the image is written to a temporary file (created via -// the package-level `createFile` helper which wraps `os.CreateTemp`) and the -// temporary file path is returned. If the URL path does not contain a filename, -// `temporary` is forced true. -// - If `temporary` is false, the image is written to `destDir` with the filename -// taken from the URL path. If the URL filename has no extension, an extension is -// appended (see extension resolution below). Existing files will be truncated by -// the underlying `createFile`/`os.Create` behavior. +// DownloadImage downloads w.ImageURL and writes it under destDir. // -// HTTP download and content-type/extension resolution: -// - The image is fetched using an HTTP GET performed with the provided context; the -// request User-Agent is set via getUserAgent(w.UserAgent). -// - Extension resolution order: -// 1) Extension from the URL path (if present). -// 2) Extension(s) derived from the Content-Type response header via mime.ExtensionsByType. -// 3) If still unknown, the first up-to-512 bytes of the body are read and -// http.DetectContentType is used to guess the content type, then mime.ExtensionsByType. -// 4) If no extension can be determined, ".img" is used as a fallback. -// - When bytes are sniffed from the body, they are prepended back to the reader so the -// full image is written to disk. When multiple extensions are returned by -// mime.ExtensionsByType the implementation picks the last returned extension. -// - File creation is performed via the package-level `createFile` variable which tests -// may override to simulate create failures. +// If temporary is true, or if the URL path has no filename, a temporary file is created. +// Otherwise the output file name is derived from the URL path. // -// Resource management and errors: -// - The response body and any created file are closed using deferred cleanup; any close -// errors are joined into the returned error. -// - I/O, network and OS errors are returned (wrapped with contextual information). -// - On success, outPath contains the absolute/relative path to the saved image file; -// on error, outPath will be empty and err will describe the failure. +// Extension resolution order is: +// 1. URL path extension +// 2. response Content-Type +// 3. sniffed content type from up to 512 bytes +// 4. fallback ".img" // -// Notes: -// - The function may truncate an existing destination file with the same name. -// - The exact behavior of temporary file placement when destDir is empty follows the -// semantics of os.CreateTemp. +// Returned errors are wrapped with context and include cleanup failures. func (w *Webinfo) DownloadImage(ctx context.Context, destDir string, temporary bool) (outPath string, err error) { if w == nil { err = errs.Wrap(ErrNullPointer) @@ -212,50 +169,16 @@ func (w *Webinfo) DownloadImage(ctx context.Context, destDir string, temporary b return } -// DownloadThumbnail downloads the image referenced by the Webinfo receiver, scales it -// to the requested width (preserving aspect ratio), and writes the resulting thumbnail -// image to disk. +// DownloadThumbnail downloads the source image, resizes it to width while keeping +// aspect ratio, and writes the thumbnail to destDir. // -// The method returns the path to the created thumbnail file or an error. Behavior details: -// - If the receiver is nil, ErrNullPointer is returned. -// - If width <= 0, a default width of 150 pixels is used. -// - destDir is cleaned and, if non-empty, created with mode 0750 (os.MkdirAll). -// - The original image is always downloaded to a temporary file via DownloadImage(..., true). -// That temporary original file is removed when the function returns (even on error). -// - The original image file is opened and decoded. If decoding fails, an error is returned. -// - The thumbnail height is computed to preserve aspect ratio: newH = round(width * origH / origW). -// newH is clamped to at least 1 pixel. -// - The image is resized using a Catmull-Rom resampler into an RGBA image of size -// width x newH. -// - The output format/extension is chosen from the decoded format: jpeg/jpg → .jpg, png → .png, -// gif → .gif. Unknown formats fall back to PNG. -// - If `temporary` is true, the thumbnail file is created via the package-level -// `createFile` helper (which wraps `os.CreateTemp`) in `destDir` using the -// pattern "webinfo-thumb-*<ext>"; the temporary file path is returned. -// - If `temporary` is false, the output filename is derived from the original image -// URL basename (falling back to "webinfo-image") and named "<base>-thumb<ext>" in -// `destDir`. -// - The encoder used to write the thumbnail is the package-level `outputImage` function -// variable; tests may replace this variable to simulate encoder failures. The image -// decoding step uses the package-level `decodeImage` wrapper around `image.Decode`, -// which tests may also override. -// - Files are properly closed with deferred cleanup; any close/remove errors are joined into -// the returned error using the errs package. -// - All filesystem, download, and image-processing errors are wrapped with contextual -// information (e.g., paths, URL) before being returned. +// width defaults to 150 when width <= 0. // -// Parameters: -// - ctx: context for cancellation and timeouts passed to DownloadImage and other operations. -// - destDir: destination directory for the thumbnail (cleaned). If empty, creation uses the -// current directory semantics of os.Create/os.CreateTemp. -// - width: desired thumbnail width in pixels (defaults to 150 if <= 0). -// - temporary: if true, create a uniquely-named temporary file; otherwise create a stable -// filename based on the original image basename. +// The source image is downloaded to a temporary file first and removed on return. +// Output uses a temporary name when temporary is true; otherwise it uses +// "<base>-thumb<ext>" derived from the original image URL. // -// Returns: -// - outPath: filesystem path to the created thumbnail file (valid when err == nil). -// - err: non-nil on failure; common failure reasons include download errors, decode errors, -// filesystem errors, and invalid image dimensions (ErrNoImageURL). +// Returned errors are wrapped with context and include cleanup failures. func (w *Webinfo) DownloadThumbnail(ctx context.Context, destDir string, width int, temporary bool) (outPath string, err error) { if w == nil { err = errs.Wrap(ErrNullPointer) @@ -377,9 +300,8 @@ func (w *Webinfo) DownloadThumbnail(ctx context.Context, destDir string, width i return } -// outputImage encodes the provided *image.RGBA src and writes it to dst using -// the encoder corresponding to the given format string. It is a variable so -// tests can replace it to simulate encoder failures. +// outputImage writes src to dst using format-specific encoders. +// Tests may replace this variable. var outputImage = func(dst *os.File, src *image.RGBA, format string) error { switch format { case "jpeg", "jpg": @@ -392,9 +314,8 @@ var outputImage = func(dst *os.File, src *image.RGBA, format string) error { return png.Encode(dst, src) // default to PNG } -// createFile is a package-level helper used to create files. It abstracts -// the creation of temporary and permanent files so tests can replace it to -// simulate failures during os.Create/os.CreateTemp. +// createFile creates temporary or permanent files. +// Tests may replace this variable. var createFile = func(temp bool, dir, pathOrPattern string) (*os.File, error) { if temp { return os.CreateTemp(dir, pathOrPattern) @@ -402,21 +323,19 @@ var createFile = func(temp bool, dir, pathOrPattern string) (*os.File, error) { return os.Create(filepath.Clean(pathOrPattern)) } -// decodeImage is a package-level wrapper around image.Decode so tests can -// replace it to simulate decoding behaviors (e.g., returning zero-dimension -// images) without modifying stdlib functions. +// decodeImage wraps image.Decode. +// Tests may replace this variable. var decodeImage = func(r io.Reader) (image.Image, string, error) { return image.Decode(r) } -// newHTTPClient returns the http.Client used for web requests. It is a package-level -// variable so tests can override it. By default it sets a 30-second timeout for -// the whole request (connect+read+write). +// newHTTPClient returns the default HTTP client used by downloads. +// Tests may replace this variable. var newHTTPClient = func() *http.Client { return &http.Client{Timeout: 30 * time.Second} } -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/webinfo_test.go b/webinfo_test.go index 20f1b1c..241759b 100644 --- a/webinfo_test.go +++ b/webinfo_test.go @@ -187,7 +187,7 @@ func TestDownloadImage_TemporaryWhenNoFilenameAndContentType(t *testing.T) { t.Fatalf("tmp file not created in dest: %s", out) } ext := filepath.Ext(out) - if ext != ".jpg" && ext != ".jpeg" && ext != ".img" { + if ext != ".jpg" && ext != ".jpeg" && ext != ".pjpeg" && ext != ".img" { t.Fatalf("unexpected extension %q", ext) } got := readFile(t, out) @@ -1068,7 +1068,7 @@ func TestDownloadThumbnail_TemporaryDefaultDest(t *testing.T) { } } -/* Copyright 2025 Spiegel +/* Copyright 2025-2026 Spiegel * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License.