Skip to content

add support for scraping with surf (tls impersonation)#6806

Open
feederbox826 wants to merge 7 commits intostashapp:developfrom
feederbox826:scrape-surf
Open

add support for scraping with surf (tls impersonation)#6806
feederbox826 wants to merge 7 commits intostashapp:developfrom
feederbox826:scrape-surf

Conversation

@feederbox826
Copy link
Copy Markdown
Member

adds https://github.com/enetx/surf support

this will let us bypass basic bot detection tests and hopefully the amount of external python dependencies. This will likely succeed even where cloudscraper fails

It is not documented, just a proof of concept and if the concept/implementation needs work

can be tested with https://github.com/feederbox826/scrapers/blob/main/scrapers/tls-fprint.yml

add

driver:
  useSurf: true

you can look up the returned fingerprint at https://ja3.zone/

expected peetprint is 1d4ffe9b0e34acac0bd883fa7f79d7b5

Code was sloppily done by copying above implementation and combining it with surf readme. User-Agent is explicitly dropped since it would defeat our anti-fingerprinting efforts and other headers excluded (open for discussion)

@discourse-stashapp
Copy link
Copy Markdown

This pull request has been mentioned on Stash Forum. There might be relevant details there:

https://discourse.stashapp.cc/t/jav-english-scraper-how-do-i-get-it-to-work/6669/8

@feederbox826 feederbox826 marked this pull request as draft April 11, 2026 04:51
@feederbox826
Copy link
Copy Markdown
Member Author

needs golang version bump, will wait for other backend merges to hit, but would help greatly simplify scrapers

@feederbox826 feederbox826 marked this pull request as ready for review April 29, 2026 02:44
Copy link
Copy Markdown
Collaborator

@Gykes Gykes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, these are just thoughts with having minimal knowledge of Surf. ANother thought is we should probably add some tests for this.

Comment thread pkg/scraper/url.go Outdated
Comment on lines +135 to +139
if resp.StatusCode >= 400 {
return nil, fmt.Errorf("http error %d:%s", resp.StatusCode, http.StatusText(resp.StatusCode))
}

defer resp.Body.Close()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer resp.Body.Close() is below the check, so any 4xx/5xx response leaks the connection. I would move the defer to immediately after the client.Do(req) error check so it covers all return paths.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider that a bug as well tbh but perhaps I have a knowledge gap. My inital thoughs was to move it above so that 400 >= isn't leaking the body. I don't think it's a blocker for the PR but I would need to research on if it's actually a problem.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I moved it up it immediately complained that i was closing before err catch, so I moved it below error ctx

Comment thread pkg/scraper/url.go
Comment thread pkg/scraper/url.go
@WithoutPants WithoutPants added this to the Version 0.32.0 milestone May 3, 2026
@WithoutPants WithoutPants added the feature Pull requests that add a new feature or functionality label May 3, 2026
Copy link
Copy Markdown
Collaborator

@WithoutPants WithoutPants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static review looks ok. I think it's worth adding a comment about explicitly omitting the user agent for reference. Should be ready for documentation.

@feederbox826
Copy link
Copy Markdown
Member Author

Static review looks ok. I think it's worth adding a comment about explicitly omitting the user agent for reference. Should be ready for documentation.

Other headers are also not copied, should they be?

@WithoutPants
Copy link
Copy Markdown
Collaborator

Other headers are also not copied, should they be?

I think so. If there's a reason not to include them, it should be similarly documented.

@feederbox826
Copy link
Copy Markdown
Member Author

feederbox826 commented May 4, 2026

confirmed it does remove User-Agent and preserved other headers

scraper demo
name: TLS Print

sceneByURL: &byURLScraper
  - action: scrapeJson
    url:
      - https://tls.peet.ws/api/clean
      - https://5dc5bf77976e4a4873f4gu793chyyyyyr.oast.pro
    scraper: tlsScraper
groupByURL: *byURLScraper
galleryByURL: *byURLScraper
imageByURL: *byURLScraper

jsonScrapers:
  tlsScraper:
    common:
      $peetprint: peetprint_hash
      $akami: akamai_hash
    scene: &objResponse
      Title: $peetprint
      Studio:
        Name: $akami
    gallery: *objResponse
    image: *objResponse

driver:
  useSurf: true
  headers:
    - Key: PullRequest
      Value: 6806
    - Key: User-Agent
      Value: stashapp/scraper
# Last Updated April 7, 2026
request
GET / HTTP/2.0
Host: 5dc5bf77976e4a4873f4gu793chyyyyyr.oast.pro
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding: gzip, deflate, br, zstd
Accept-Language: en-US,en;q=0.9
Priority: u=0, i
Pullrequest: 6806
Sec-Ch-Ua: "Not:A-Brand";v="99", "Google Chrome";v="145", "Chromium";v="145"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36

@feederbox826 feederbox826 requested a review from DogmaDragon May 4, 2026 02:35
@feederbox826 feederbox826 added the missing documentation Feature or functionality lacks proper documentation label May 4, 2026
Copy link
Copy Markdown
Collaborator

@DogmaDragon DogmaDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation check passed.

@DogmaDragon DogmaDragon removed the missing documentation Feature or functionality lacks proper documentation label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Pull requests that add a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants