Yeah just looking at HTTPD logs isn't going to help all that much with that. Too much noise, not enough signal. The best starting point would probably be going off the assumption that by default the bots used by the bigger copyright enforcement agencies are simple generic ones. Meaning that they don't tune it or write a whole new bot on a per-site basis and they don't make any effort to simulate a real human interacting with a site. Another safe assumption is that off-the-shelf anti-bot solutions like captchas or Cloudflare/DDoS-Guard challenges aren't an issue for them because they have either been whitelisted by those services, or it's cheap and easy to bypass (which is the case). And finally, they almost certainly render the page (JS).
So with that in mind you could do some basic user interaction tracking. Check where the clicks are happening. Check if there is mouse movement for desktop User-Agents. Implement a simple custom captcha-like system that a generic bot can't pass without adding extra code for your special snowflake solution. Play with invisible HTML honeypot links (but make sure not to serve them to normal crawlers or otherwise get them indexed). I think a lot of these bots don't even bother checking the actual video content so try tracking IPs that visit a lot of pages but never watch anything (again, be careful not to shoot legit crawlers). And my personal favorite, get creative with the fact they will happily render any JS you throw at them ,so you can feed the bots some
unlubed dildos .
Just a few ideas. If you're dedicated and creative enough it's prolly not that hard filter out the bulk of DMCA bots with enough patience. You'll never get rid of all of them though. And to keep them out it'll always be a cat and mouse game.
This is exactly the kind of information I was looking for.
I wonder to what extent Cloudflare and (which) Captcha systems actually let such DMCA bots through. Especially since such CloudFlare doesn't have them on its “Verified Bots” list, and DMCA bots can be even worse than typical scraping bots, because they don't even try to hide or limit their actions.
I'm also puzzled by the overall effectiveness of CloudFlare, as their anti bot is heavily dependent on the paid plan you use.
I suspect the free anti bot is on the level of ip tables.
The Pro version has higher effectiveness, but supposedly only some Business plans have machine learning type solutions for bot recognition, or at least that's what I read on some formum.
The way from your topic is interesting, but rather has more disadvantages than benefits in the long run
Nowadays the loading speed of a page (and all its elements) is very important, in addition, I noticed an interesting thing in Comeso bots.
Generally, if you could successfully block the Comeso bots (or other specific companies that have picked up on your site), that's half the success, because they are responsible for most of the reports and spam.
I have implemented from CloudFlare rate limit for specific pages and found a lot of bots in the logs.
One of them was the Comeso bot. I don't have anything that exactly links that IPv6 address to them, but it would agree with what they've reported regularly so far.
In short, despite being blocked by the rate limit further the same ip checked further addresses (at the end it iterated the number from the address).
It could not be the user, because he would sooner go back than go to the next page, in addition, it persistently checked further pages despite the continued blocking and each checked a second later.
This theory is further supported by the fact that after the introduction of this restriction they started reporting pages without content.
It doesn't throw a 404, but there is nothing on them except the page template (navigation, footer, etc.).
In short, I will have to create some kind of private list of DMCA bot ip, because the typical blockers are just a reason to increase false positive submissions and cut off their access to the website completely.
That now leaves few questions.
1. How much is it worth spending the $20-25 for the Pro plan at CloudFlare. I know it won't give full protection, but how much will it increase the effectiveness and is it worth it.
2. Are there any pre-made solutions like “Project Honey Pot” that would help block such bots.
I'm also curious how effective the aforementioned “Project Honey Pot” can be in blocking and detecting such bots. Or do they also intentionally let such bots through.
3. I plan to use some more accurate logs (maybe Matomo?) where I would be able to highlight IP addresses that fit my DMCA bot pattern.
Here, too, maybe someone has some tips or advice on what to use and how to increase effectiveness.