If you’ve spent any time coding in InfoSec, you’ve probably used a ton of curl to pull websites, check them for various issues or attributes, etc.

This will follow redirects and provide a non-curl User Agent.

curl -LA 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5' reddit.com

This used to work quite well, but now—not so much.

For one, curl doesn’t parse and render JavaScript, and that’s what the internet is made out of. But perhaps even worse, many companies are employing technologies to outright detect and block curl because it’s often used for scraping.

Either way, if you use curl to pull a lot of sites en masse, you’re likely to have a massive failure rate in getting the HTML you’re looking for.

What we’ve needed for quite some time is something like curl, i.e., command-line and relatively simple, but that renders sites fully.

I’ve been using chromium (part of the Chrome project) to solve this problem for years, and I wanted to pass along the syntax for others.

I am usually doing things from Ubuntu, but you can get this to work on most UNIXy systems.

cat domains.txt | xargs -I {} -P 4 sh -c timeout 25s chromium-browser –headless –no-sandbox –user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' –dump-dom https://{} 2> /dev/null > {}.html

That’s a lot to uravel, so:

What you basically end up with—assuming you have a decent machine to run this on—is hundreds of nicely rendered HTML files being created very quickly. Chromium is Chrome, so you’re getting the full rendering of the JavaScript and all the goodness that comes with that.

Anyway, I hope this helps someone who’s smashing their face on the desk because of curl.

源链接

Hacking more

...