add support for scraping with surf (tls impersonation)#6806
add support for scraping with surf (tls impersonation)#6806feederbox826 wants to merge 7 commits intostashapp:developfrom
Conversation
|
This pull request has been mentioned on Stash Forum. There might be relevant details there: https://discourse.stashapp.cc/t/jav-english-scraper-how-do-i-get-it-to-work/6669/8 |
|
needs golang version bump, will wait for other backend merges to hit, but would help greatly simplify scrapers |
d680c5a to
d8bee48
Compare
Gykes
left a comment
There was a problem hiding this comment.
So, these are just thoughts with having minimal knowledge of Surf. ANother thought is we should probably add some tests for this.
| if resp.StatusCode >= 400 { | ||
| return nil, fmt.Errorf("http error %d:%s", resp.StatusCode, http.StatusText(resp.StatusCode)) | ||
| } | ||
|
|
||
| defer resp.Body.Close() |
There was a problem hiding this comment.
defer resp.Body.Close() is below the check, so any 4xx/5xx response leaks the connection. I would move the defer to immediately after the client.Do(req) error check so it covers all return paths.
There was a problem hiding this comment.
There was a problem hiding this comment.
I would consider that a bug as well tbh but perhaps I have a knowledge gap. My inital thoughs was to move it above so that 400 >= isn't leaking the body. I don't think it's a blocker for the PR but I would need to research on if it's actually a problem.
There was a problem hiding this comment.
When I moved it up it immediately complained that i was closing before err catch, so I moved it below error ctx
WithoutPants
left a comment
There was a problem hiding this comment.
Static review looks ok. I think it's worth adding a comment about explicitly omitting the user agent for reference. Should be ready for documentation.
Other headers are also not copied, should they be? |
I think so. If there's a reason not to include them, it should be similarly documented. |
|
confirmed it does remove User-Agent and preserved other headers scraper demoname: TLS Print
sceneByURL: &byURLScraper
- action: scrapeJson
url:
- https://tls.peet.ws/api/clean
- https://5dc5bf77976e4a4873f4gu793chyyyyyr.oast.pro
scraper: tlsScraper
groupByURL: *byURLScraper
galleryByURL: *byURLScraper
imageByURL: *byURLScraper
jsonScrapers:
tlsScraper:
common:
$peetprint: peetprint_hash
$akami: akamai_hash
scene: &objResponse
Title: $peetprint
Studio:
Name: $akami
gallery: *objResponse
image: *objResponse
driver:
useSurf: true
headers:
- Key: PullRequest
Value: 6806
- Key: User-Agent
Value: stashapp/scraper
# Last Updated April 7, 2026requestGET / HTTP/2.0
Host: 5dc5bf77976e4a4873f4gu793chyyyyyr.oast.pro
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding: gzip, deflate, br, zstd
Accept-Language: en-US,en;q=0.9
Priority: u=0, i
Pullrequest: 6806
Sec-Ch-Ua: "Not:A-Brand";v="99", "Google Chrome";v="145", "Chromium";v="145"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36 |
DogmaDragon
left a comment
There was a problem hiding this comment.
Documentation check passed.
adds https://github.com/enetx/surf support
this will let us bypass basic bot detection tests and hopefully the amount of external python dependencies. This will likely succeed even where cloudscraper fails
It is not documented, just a proof of concept and if the concept/implementation needs work
can be tested with https://github.com/feederbox826/scrapers/blob/main/scrapers/tls-fprint.yml
add
you can look up the returned fingerprint at https://ja3.zone/
expected peetprint is
1d4ffe9b0e34acac0bd883fa7f79d7b5Code was sloppily done by copying above implementation and combining it with surf readme. User-Agent is explicitly dropped since it would defeat our anti-fingerprinting efforts and other headers excluded (open for discussion)