Anti DMCA bot/scrapper

AZAnon · Oct 23, 2024

Rather, the topic has already appeared many times in various different forms, but the topics found did not give me a satisfactory answer.
Any blacklists from DMCA bot IPs appear?
Rather a lot of sites have a problem with them, there have been many threads on forums about them and I'm a bit surprised that no one has made such a list with DMCA bot IPs.

Pretty much most work by searching for matching phrases on Google or another search engine.

But when they find a site that fits the pattern and set it as a target, they act as a regular scraping bot that checks newly appearing entries (the latest on the homepage, etc.) and collects data from them.

Do you have any ways to effectively detect and block such bots (scrappers under DMCA submissions)?

For example, the sort of Comeso that even spam reports, often with false positives, reporting even non-existing pages (no content on them except the page template).

Hyperz · Oct 23, 2024

I'd imagine that either they use rotating proxies, making such a list rather pointless, or if they use few static IPs it might just be a case of keeping them private to have an advantage on the competition. After all, if you manage to block the bulk of DMCA bots while your main competitor is getting flooded by them obviously that would be a good thing for your site. But IMHO even if someone was offering such a list, would you really go ahead and block those IPs not knowing if they are in fact DMCA bot IPs? Without validation you could end up blocking a bunch of random IPs, or worse, blocking Google/Bing/etc crawlers or CDN IPs.

With all that said, it shouldn't be too hard for anyone that knows some basic coding (which TBH you should if you're running a site and are serious about it) to figure out the which requests are coming from DMCA bots and what IPs they are using.

AZAnon · Oct 23, 2024

Hyperz said:
I'd imagine that either they use rotating proxies, making such a list rather pointless, or if they use few static IPs it might just be a case of keeping them private to have an advantage on the competition. After all, if you manage to block the bulk of DMCA bots while your main competitor is getting flooded by them obviously that would be a good thing for your site. But IMHO even if someone was offering such a list, would you really go ahead and block those IPs not knowing if they are in fact DMCA bot IPs? Without validation you could end up blocking a bunch of random IPs, or worse, blocking Google/Bing/etc crawlers or CDN IPs.

With all that said, it shouldn't be too hard for anyone that knows some basic coding (which TBH you should if you're running a site and are serious about it) to figure out the which requests are coming from DMCA bots and what IPs they are using.

Perhaps I asked the question incorrectly.

I mentioned blacklists because there are some with IP lists of “bad bots”.

Looking through the logs, I also caught a few bots (behavior pattern), but in general, the topic of DMCA bots can be related to scrapper bots.

They work on a similar way and I wanted to ask about solutions ready to secure against scrappers, including DMCA bots.

For example, anyone confirmed that some kind of honeypot trap can effectively catch DMCA scrappers/bots?

Obviously excluding some of the simplest advice, like analyzing logs for specific behavior, blocking by headers, etc.

Hyperz · Oct 23, 2024

Yeah just looking at HTTPD logs isn't going to help all that much with that. Too much noise, not enough signal. The best starting point would probably be going off the assumption that by default the bots used by the bigger copyright enforcement agencies are simple generic ones. Meaning that they don't tune it or write a whole new bot on a per-site basis and they don't make any effort to simulate a real human interacting with a site. Another safe assumption is that off-the-shelf anti-bot solutions like captchas or Cloudflare/DDoS-Guard challenges aren't an issue for them because they have either been whitelisted by those services, or it's cheap and easy to bypass (which is the case). And finally, they almost certainly render the page (JS).

So with that in mind you could do some basic user interaction tracking. Check where the clicks are happening. Check if there is mouse movement for desktop User-Agents. Implement a simple custom captcha-like system that a generic bot can't pass without adding extra code for your special snowflake solution. Play with invisible HTML honeypot links (but make sure not to serve them to normal crawlers or otherwise get them indexed). I think a lot of these bots don't even bother checking the actual video content so try tracking IPs that visit a lot of pages but never watch anything (again, be careful not to shoot legit crawlers). And my personal favorite, get creative with the fact they will happily render any JS you throw at them ,so you can feed the bots some unlubed dildos

.

Just a few ideas. If you're dedicated and creative enough it's prolly not that hard filter out the bulk of DMCA bots with enough patience. You'll never get rid of all of them though. And to keep them out it'll always be a cat and mouse game.

AZAnon · Oct 24, 2024

Hyperz said:
Yeah just looking at HTTPD logs isn't going to help all that much with that. Too much noise, not enough signal. The best starting point would probably be going off the assumption that by default the bots used by the bigger copyright enforcement agencies are simple generic ones. Meaning that they don't tune it or write a whole new bot on a per-site basis and they don't make any effort to simulate a real human interacting with a site. Another safe assumption is that off-the-shelf anti-bot solutions like captchas or Cloudflare/DDoS-Guard challenges aren't an issue for them because they have either been whitelisted by those services, or it's cheap and easy to bypass (which is the case). And finally, they almost certainly render the page (JS).

So with that in mind you could do some basic user interaction tracking. Check where the clicks are happening. Check if there is mouse movement for desktop User-Agents. Implement a simple custom captcha-like system that a generic bot can't pass without adding extra code for your special snowflake solution. Play with invisible HTML honeypot links (but make sure not to serve them to normal crawlers or otherwise get them indexed). I think a lot of these bots don't even bother checking the actual video content so try tracking IPs that visit a lot of pages but never watch anything (again, be careful not to shoot legit crawlers). And my personal favorite, get creative with the fact they will happily render any JS you throw at them ,so you can feed the bots some unlubed dildos .

Just a few ideas. If you're dedicated and creative enough it's prolly not that hard filter out the bulk of DMCA bots with enough patience. You'll never get rid of all of them though. And to keep them out it'll always be a cat and mouse game.

This is exactly the kind of information I was looking for.

I wonder to what extent Cloudflare and (which) Captcha systems actually let such DMCA bots through. Especially since such CloudFlare doesn't have them on its “Verified Bots” list, and DMCA bots can be even worse than typical scraping bots, because they don't even try to hide or limit their actions.
I'm also puzzled by the overall effectiveness of CloudFlare, as their anti bot is heavily dependent on the paid plan you use.

I suspect the free anti bot is on the level of ip tables.

The Pro version has higher effectiveness, but supposedly only some Business plans have machine learning type solutions for bot recognition, or at least that's what I read on some formum.
The way from your topic is interesting, but rather has more disadvantages than benefits in the long run

Nowadays the loading speed of a page (and all its elements) is very important, in addition, I noticed an interesting thing in Comeso bots.
Generally, if you could successfully block the Comeso bots (or other specific companies that have picked up on your site), that's half the success, because they are responsible for most of the reports and spam.

I have implemented from CloudFlare rate limit for specific pages and found a lot of bots in the logs.

One of them was the Comeso bot. I don't have anything that exactly links that IPv6 address to them, but it would agree with what they've reported regularly so far.
In short, despite being blocked by the rate limit further the same ip checked further addresses (at the end it iterated the number from the address).

It could not be the user, because he would sooner go back than go to the next page, in addition, it persistently checked further pages despite the continued blocking and each checked a second later.

This theory is further supported by the fact that after the introduction of this restriction they started reporting pages without content.

It doesn't throw a 404, but there is nothing on them except the page template (navigation, footer, etc.).

In short, I will have to create some kind of private list of DMCA bot ip, because the typical blockers are just a reason to increase false positive submissions and cut off their access to the website completely.
That now leaves few questions.

1. How much is it worth spending the $20-25 for the Pro plan at CloudFlare. I know it won't give full protection, but how much will it increase the effectiveness and is it worth it.

2. Are there any pre-made solutions like “Project Honey Pot” that would help block such bots.

I'm also curious how effective the aforementioned “Project Honey Pot” can be in blocking and detecting such bots. Or do they also intentionally let such bots through.

3. I plan to use some more accurate logs (maybe Matomo?) where I would be able to highlight IP addresses that fit my DMCA bot pattern.

Here, too, maybe someone has some tips or advice on what to use and how to increase effectiveness.

Hyperz · Oct 24, 2024

Honestly Cloudflare is completely useless for this. I've been writing bots for over 15 years and I've never not been able to scrape sites that use them as bot protection. And their so called "AI-based" bot protection is pure marketing BS. It's basically just a blacklist of know "bad" TLS fingerprints and simple stuff like checking the request headers and checking if the requested path matches the other traffic. Like if you request "/movies/latest/" but the actual path is "/movies/latest" it'll show you a page that says your IP has been blocked. But that's only for that one request. It didn't actually block anything site-wide.

Regarding the DDoC thing, I mentioned it as an example of how you can "attack" known DMCA bots rather than using it as a site-wide solution to protect/hide actual content from them. For example you mentioned you're showing some known bots a blank template. Why not serve them a page designed to put as much CPU/memory pressure on it as possible instead of letting them off the hook? Maybe even try engineering some JS that reliably freezes a Chromium/V8 process. Think outside of the box.

AZAnon · Oct 24, 2024

Hyperz said:
Honestly Cloudflare is completely useless for this. I've been writing bots for over 15 years and I've never not been able to scrape sites that use them as bot protection. And their so called "AI-based" bot protection is pure marketing BS. It's basically just a blacklist of know "bad" TLS fingerprints and simple stuff like checking the request headers and checking if the requested path matches the other traffic. Like if you request "/movies/latest/" but the actual path is "/movies/latest" it'll show you a page that says your IP has been blocked. But that's only for that one request. It didn't actually block anything site-wide.

Regarding the DDoC thing, I mentioned it as an example of how you can "attack" known DMCA bots rather than using it as a site-wide solution to protect/hide actual content from them. For example you mentioned you're showing some known bots a blank template. Why not serve them a page designed to put as much CPU/memory pressure on it as possible instead of letting them off the hook? Maybe even try engineering some JS that reliably freezes a Chromium/V8 process. Think outside of the box.

I know that CloudFlare is not suitable for catching DMCA bots and scrapers, but I'm curious about your opinion on whether it's worth buying a Pro plan there as a general overview.

For bots and scrappers, I'm considering Project Honey Pot implementations.

Besides, maybe Matomo will have some extensions and the ability to filter logs under catching such bots to make a private blacklist of IPs that most likely belong to DMCA bots.
As for the blank page.

It wasn't exactly created for catching bots, it's more of a template for a page where the content is supposed to appear, but currently you can only get to it by typing in the link directly (iterating through the numbers), as there is no link to it on the page at the moment.

But after your suggestion, I think it might be worth checking out the use of this for bots.

In any case, do you have any information or personal opinion on the use of Project Honey Pot and Matomo for this purpose?

Perhaps any other solutions or approaches worth checking out besides what has been mentioned so far?

Hyperz · Oct 25, 2024

AZAnon said:
In any case, do you have any information or personal opinion on the use of Project Honey Pot and Matomo for this purpose?

I've never used them so I can't comment on them. Same for the paid Cloudflare plans. I haven't ran a site in forever. But just thinking out loud, if something like Project Honey Pot also targets DMCA bots copyright enforcement agencies can frame that as being complicit in harboring/protecting piracy. So for that reason alone I'd be surprised if that would help you block DMCA bots from the bigger players.

AZAnon said:
Perhaps any other solutions or approaches worth checking out besides what has been mentioned so far?

You should have a solid starting point to start going after them with what's been said, provided you know how to code it. It might also be worth it to create your own simple bot that behaves like these DMCA bots and use it on your site to see if you can reliably catch your own bot. And if you can, go back to the bot and see if you can improve it to the point where it's undetected again or can bypass your solution. Then repeat this process until you have a good understanding of both how these kinds of bots work and how to go after them. Virtual war games, basically. You kinda need to know how to bot to know how fight them properly. Same as if you want to really understand security you need to understand how to hack. If you want to do software DRM properly you need to know how to crack software. Same concept.

Anti DMCA bot/scrapper

AZAnon

Member

Hyperz

Active Member

AZAnon

Member

Hyperz

Active Member

AZAnon

Member

Hyperz

Active Member

AZAnon

Member

Hyperz

Active Member