Inconsistent Google Referrer Spoof #171
Replies: 4 comments
-
|
This is an interesting behaviour. I have opened an issue for this discussion to remind me to look into it later here. |
Beta Was this translation helpful? Give feedback.
-
|
The testing here is thorough, and the conclusion makes sense. Fixing the visible Referrer header is straightforward, but once the site is looking at headers like Sec-Fetch-Site, the browser security model becomes the real constraint rather than your implementation. I would probably document this as a known limit instead of treating it as a bug still waiting to be solved. That makes expectations clearer for users and avoids suggesting that a full spoof is possible when the browser does not actually allow it. |
Beta Was this translation helpful? Give feedback.
-
Spoofing Sec-Fetch-Site HeaderI was able to do with html inline into page.set_content() then clicking the link. You can use page.route or page.set_extra_http_headers. But setting extra http headers will effect all additional requests to the page so a route implementation would be best. ( I have not looked at other cdp impmentations to do this ) Codeimport asyncio
from patchright.async_api import async_playwright
async def spoof_cross_site():
async with async_playwright() as p:
browser = await p.chromium.launch(
channel="chrome",
proxy=None,
headless=False,
)
context = await browser.new_context(no_viewport=True)
page = await context.new_page()
state = set(['https://manytools.org/http-html-text/http-request-headers/','https://httpbin.org/headers'])
async def handle_route(route):
if route.request.url in state:
await route.continue_(headers={
**route.request.headers,
"Sec-Fetch-Site": "cross-site",
"Referer": "https://www.google.com/"
})
else:
await route.continue_()
await page.route("**/*", handle_route)
url = "https://httpbin.org/headers"
await page.set_content(f'<a href="{url}" id="link">Go</a>')
await page.click("#link")
await asyncio.to_thread(input, "Press any to continue: ")
await browser.close()
asyncio.run(spoof_cross_site())ResultThe resulting request headers captured by httpbin: {
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Priority": "u=0, i",
"Referer": "https://www.google.com/",
"Sec-Ch-Ua": "\"Not:A-Brand\";v=\"99\", \"Google Chrome\";v=\"145\", \"Chromium\";v=\"145\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36"
}
}I believe everything matches up here except for google cookies. Which I heard some people visit google just to grab those and make there scraper look good. |
Beta Was this translation helpful? Give feedback.
-
EnhancementCould not stand the route("**/*") implementation so found out I could use route.fulfill on "google.com/spoof". Alls we have to do then is change the url. Codeimport asyncio
# patchright here!
from patchright.async_api import async_playwright
async def spoof_cross_site():
async with async_playwright() as p:
browser = await p.chromium.launch(
channel="chrome",
proxy=None,
headless=False,
)
context = await browser.new_context(no_viewport=True)
page = await context.new_page()
async def handle_route(route):
nonlocal url
await route.fulfill(
status=200,
content_type="text/html",
body=f"""
<html>
<body>
<a id="link" href="{url}" referrerpolicy="origin">Go</a>
</body>
</html>
"""
)
await page.route("https://www.google.com/spoof", handle_route)
url = 'https://httpbin.org/headers'
await page.goto("https://www.google.com/spoof")
await page.click('#link')
await asyncio.to_thread(input, "Press any to continue: ")
await browser.close()
asyncio.run(spoof_cross_site())Notice referrerpolicy="origin" shortens "https://www.google.com/spoof" to "https://www.google.com/" in the headers like how google actually does it |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I noticed two inconsistencies in Scrapling's spoofed Google Referrer Header.
Issue 1 — Incorrect
ReferrerHeader (Easy Fix)Google doesn't include the query in the referrer header. It should simply be:
Issue 2 — Incorrect
Sec-Fetch-SiteHeader (Complex)Sec-Fetch-Siteshould becross-siteif the request is actually coming from Google. The current implementation sendsnone.This one is much harder to fix because
Sec-Fetch-Siteis a Forbidden Request Header — browsers set it automatically and block scripts from modifying it. I was unable to override it via:page.set_extra_http_headers()page.route()Wanted to get your thoughts on this and whether it's even worth addressing.
What I Tried (All Three Approaches in same script = ^ )
Results of no spoof, current spoof (with updated url), and real google link
page.goto('https://manytools.org/...')— no referer setpage.goto('https://manytools.org/...', referer="https://www.google.com/")Navigating directly from Google (expected behavior)
Beta Was this translation helpful? Give feedback.
All reactions