For better or worse, we live in a world where data is power. This has been our reality for years, but the recent emergence of mainstream AI has only made this more publicly apparent from companies absorbing as much as they can train their LLMs.
Debates have already been had about whether using your data in these LLMs is ethical. Still, I wanted to dive deeper and examine if the tools that AI companies and many others use to collect information from the internet are ethical or not.
I’ve built most of the foundational parts of my early career and knowledge by doing a lot of web scraping for contracts, personal projects, and internships. Briefly about me, my most successful project by far is my TikTokAPI python web scraper with 1.5M+ downloads. When I was learning web scraping, there was no good single resource and I ended up making everything-web-scraping to help others learn. Doing all of this helped me get many web scraping related jobs with many of their business models relying heavily on web scraped data, check out my career page to learn more if you’re interested.
This is all to say I’ve been pretty entrenched in the web scraping world for over 5 years even if I was primarily a student during most of this time.
My goal with this blog is to inform people about what I’ve seen in this industry and my own opinions on what I think about web scraping after all these years. In addition, I’d love to get other people’s opinions on this industry.
First, we need to define what web scraping is. The best working definition I have for it is
Web scraping refers to extracting data that you don’t have control over from a third-party website via manual or automated methods.
For example, if you wanted to track the price of an Amazon product. You could use web scraping to extract the current price of that item programmatically, and then send a text notification if it’s below a certain price. Or anything else you wanted with that data.
Seems pretty simple right? It’s all based on the idea that anything you can see on your web browser you can extract data from. This is because your web browser has to get the data from somewhere to render it and you can inspect all the files, data, and requests that your web browser makes to the server that’s responsible for showing you content on a website.
However, it does get fairly complex. As I mentioned earlier, data is power and companies want to protect their own data as it’s often what keeps making them money. What would happen to a social media site like Facebook or Twitter, if all of their content and funny jokes were accessible from a competitor’s site? They’d lose some of their competitive advantage that keeps users on their platform. So to combat this, companies will try to stop web scrapers via numerous methods, but there is no fully effective stop to web scraping. As the content has to get shown to end-users and thus can be web scraped, it might be more difficult and more expensive but it’ll always be possible.
Let’s first dive into some of the ethical parts of web scraping and the good it can do in the world.
One of the most ethical uses for web scraping is for researchers to collect data on either trends happening online or use what algorithms are suggesting to infer information about the model itself.
Most of the time these large companies using algorithmic recommendation systems do not give researchers access to the systems directly, and they have to resort to web scraping to be able to understand what these systems are doing and how it can influence the world around them. Often these systems are so complex that even the creators struggle to explain why it’s recommending what it is and the effects it can have on users might not be understood by the creators.
One investigative firm, tracking.exposed has utilized web scraping to collect data to write reports on YouTube’s algorithm and political polarization, TikTok shadow bans around the Russian invasion of Ukraine, investigating Facebook Ads in political elections, and much more.
Even researchers using my own TikTokAPI have used it to investigate how TikTok shapes our own identities and the environment we live in with researchers from Yale, Northwestern, the United Nations (UNESCO), and more.
A critical part of researching these topics is to actually use the exact same systems that the real users are interacting with. As some of these researchers have told me, API endpoints that these companies develop specifically for researchers to use, can have different results from requests made from the actual production sites. This potentially influences the research paper’s conclusions. I don’t have enough information to confidently say one way or another if these differences are intentionally crafted to mislead researchers or just slightly different compared to the live endpoints that real users interact with. Either way, because of the potential poisoning of research API endpoint data, additional data has to be gathered through web scraping of the live endpoints.
Imagine this, you want to make your own music label and you want to stay competitive with existing music labels. So you want to leverage data to make data-guided decisions about who to sign. What are some data sources that might be important to you? Some of them likely would be YouTube, Spotify, Instagram, Facebook, etc. If you’re just starting out, you have absolutely no leverage over asking these companies for data from their platforms, and likely no connections at those companies to help you out.
I believe that web scraping increases competition in especially highly competitive data-driven industries, as otherwise large companies with leverage to ask for this information would out-bid and out-compete entering business and ultimately become another barrier to entry to their industry.
However, let’s be honest. There are a lot of bad actors in the web scraping industry with shady and in my opinion definitely unethical business practices.
One valid concern about web scraping is that the data you collect might be about a specific individual, and since the right to be forgotten is becoming a more popular belief and already codified into laws in some places like with the GDPR this is quite tricky. How can you as a citizen exercise your right to be forgotten, when even if you request Facebook to remove your data that someone else hasn’t already downloaded that data and just has it sitting somewhere? The answer is that you can’t.
Sites like the internet archive have an entire mission to download a wide variety of sites and store the data for anyone to look back at. To their credit, they do have a removal request form that if you wish you can remove specific URLs, like an account you might own. But there are a lot of other sites that probably won’t do that, and it’s super inconvenient to have to track down all of these mirror sites and email them. It would be easier to fill out forms on sites where you explicitly signed up for an account. The amount of times I’ve tried searching for something and having dozens of sites directly mirroring data from sites like GitHub and Reddit is quite absurd and would take ages to find your data on them all.
And this only includes people who have web scraped data that they decided to publish as a mirror website! Which is only a small small fraction of the reason people web scrape. Most web scraped information is definitely sitting on someone’s hard drive or database somewhere!
How do you track down this information? You literally just can’t. It’s impossible to ensure that all of your information has been deleted from everywhere. Websites can’t tell if your data has been web scraped or just viewed by an actual user. I know it gets said all of the time but even with additional regulation about data, truly anything you put on the internet is out there forever. You can’t ever know if what you deleted is really gone forever.
As we become more aware of different scamming tactics, scammers switch to more effective personalized scamming techniques, and with web scraping this kind of personalization can be automated. Imagine you have a list of emails or full names of a list of targets that you want to attempt to scam. You could set up web scraping tools to find your victim’s social media accounts and maybe check for people that they follow with the same last name as them and pretend to be that person.
Obviously, all of this can be done manually, but automation can increase the effectiveness of these operations.
This is where web scraping probably is at its worst.
Data brokers aggregate, enrich, analyze, and then sell data.
Data brokers don’t just leverage web scraping, they often will pay for data about search results, addresses you’ve lived at, your purchase history, etc. After they have this information, they’ll often bundle up your data into “profiles” of groups of people and sell your information claiming it’s anonymized. Often, this data can be de-anonymized fairly easily.
I could rant about this for quite a bit of time, but I’ll try to keep it brief. They often use web scraping to pull out information from social media profiles like your name, emails, job title, etc. Then they can combine this information with other sources that they either have scraped or purchased. Finally, they bundle it up and then sell the information.
You might think that this information doesn’t affect you, but even different US government entities buy data from data brokers like the NSA, ICE, IRS, DEA, and likely more than that.
What happens if these data brokers get things wrong? What if they make the wrong conclusion about who you are, or get something wrong based on the data they have? Well, it might affect your credit score, health decisions, insurance pricing, and more.
I’ll stop here, but there’s a lot of material online about data brokers as this has been gaining more traction in the last few years. Here are a quick few interesting links
The FTC relatively recently has started putting more pressure on data brokers. However, I’m sure we’ve all gotten the YouTube ads for the tools that will for a fee remove you from data brokers lists, which at least to me seems a little bit absurd an entire industry removing your personal data has popped up.
Web scraping has some amazing benefits to the world, especially around letting researchers get a better understanding of the systems we interact with daily, and how it forms the worlds around us. However, it has insidious downsides.
We need more clarity from regulators on web scraping. For years, web scraping has remained in a legally gray area. First regulators give a clear answer on the legality of web scraping, and the boundaries of what type of information you can collect. Then they can regulate what you can do with web scraped data.
These regulations will not stop people from using it for scamming, or just downloading data onto their computer, thus blocking your right to be forgotten. However, it will help prevent institutional web scrapers like data brokers from selling your data directly without your consent. But, I would caution against these regulations being too restrictive because if the barrier to web scraping legally is too high, then it would hurt new companies trying to enter into highly competitive data-driven industries.
Back to blog