How web scraping can be a valuable data source
Web scraping offers another opportunity to gather public data relevant to your business and can work as an adjunct to other data sources.
Web scraping. It sounds like hard work, but it is more clever than arduous.
The technique exploits a simple truth: The front end of the web site, which you see, must talk to the back end to extract data, and display it. A web crawler or bot can gather this information. Further work can organize the data for analysis.
Digital marketers are forever seeking data to get a better sense of consumer preference and market trends. Web scraping is yet one more tool towards that end.
First crawl, then scrape
“In general, all web scraping programs accomplish the same two tasks: 1) loading data and 2) parsing data. Depending on the site, the first or second part can be more difficult or complex.” explained Ed Mclaughlin, partner at Marquee Data, a web scraping services firm.
Web scraping bears some resemblance to an earlier technique: web crawling. Back in the 1990s, when the internet occupied less cyber space, web crawling bots compiled lists of web sites. The technique is still used by Google to scrape for key words to power its search engine, noted Himanshu Dhameliya, sales director at process automation and web scraping company Rentech Digital.
For Rentech, web scraping is just obtaining “structured data from a mix of different sources,” Dhameliya said. “We scrape news web sites, financial data, and location reports.”
“Web scraping data is collected on a smaller scale,” said George Tskaroveli, project manager at web scrapers Datamam, “still amounting to millions of data points, but also collecting on a daily or more frequent basis,” he said.
“The defining features of modern web scraping are headless browsers, residential proxies, and the use of scalable cloud platforms,” said Ondra Urban, COO at scraping and data extraction firm Apify. “With a headless browser, you can create scrapers that behave exactly like humans, open any website and extract any data… [M]odern cloud platforms like AWS, GCP, or Apify allow you to instantly start hundreds or thousands of scrapers, based on the current demand for data.”
Which party data? And how to get it
There is a spectrum of data gathering, ranging from zero-party to third-party data, that marketers are forever picking through for the next insight. So where does web scraping fit into this continuum?
“Web scraped data is most closely related to third-party data.” Said Mclaughlin, as marketers can then join this data with existing data sets. “Web scraping can also provide a unique data source that’s not heavily used by competitors as may be the case with purchased lists.” He said.
“Ninety-five percent of the work we do is third-party [data],” said Dhameliya. Scraping aims for the data trafficked between the front-end and back-end of the web site. That may require an API crafted to tap this data stream, or using JavaScript with a Selenium driver, he explained.
Most of Rentech’s work is for enterprises seeking marketing intelligence and analysis. Bots are tasked with periodic visits of web sites, sometimes seeking product information, Dharmeliya said. Some web sites limit the number of queries coming from a single source. To get around that, Rentech will use AWS Lambda to execute a bot that will launch queries from multiple machines to get around query limitations, Dhameliya explained.
It is not humanly possible to go through all the data to weed out “nulls and dupes,” Tskaroveli said. “Many clients collect data with their own devices or use free-lancers. It’s a huge problem, not receiving clean data,” he said. Datamam relies on its own in-build algorithms to go through the “rows and columns”, automating quality assurance.
“We write custom python scripts to scrape websites. Usually, each one is customized to handle a specific website, and we can provide custom inputs, if needed,” said McLaughlin. “We do not use any AI or machine learning to automate the production of these scripts, but that technology could be used in the future.”
Any data that can be manually copied and pasted can be automatically scraped.” Mclauglin added. “[I]f you find a website with a directory of a list of potential leads, web scraping can be used to easily convert that website into a spreadsheet of leads that can then be used for downstream marketing processes.”
“Social media are a different beast. Their web and mobile applications are extremely complex, with hundreds of APIs and dynamic structures, and they also change very often thanks to regular updates and A/B tests,” Ondra said. “[U]nless you can train and support a large in-house team, the best way to do it is to buy it as a service from experienced developers.”
“If [the client] is in ecommerce, you might get away with an AI-powered product scraper. You risk a lower quality of data, but you can easily deploy it over hundreds or thousands of websites,” Ondra added.
(Once market data is flowing in, it needs to be managed. That’s discussed in depth here.)
Scrape the web, but use some common sense
There are limits — and opportunities — that come with web scraping. Just be aware that privacy considerations must temper the query. Web scraping is a selective, not a collective, drag net.
Data privacy is one of those limits. “Never collect the opinions or political views or information about families, or personal data,” said Dharmeliya. Evaluate the legal risk before scraping. Do not collect any data that is legally risky.
It’s important to understand that web scraping isn’t — and for legal reasons shouldn’t be — about collecting personal identifiable information. Indeed, web scraping of any data has been controversial, but has largely survived legal scrutiny, not least because it’s hard to draw a legal distinction between web browsers and web scrapers, both of which request data from websites and do things with it. This has been litigated recently.
Facebook, Instagram and LinkedIn do have rules governing which data can be scraped and which data is off-limits, Dharmeliya said. For example, individual Facebook and Instagram accounts that are closed are private accounts. Anything that feeds data to the public world is fair game — New York Times, Twitter, any space where users can post commentary or reviews, he added.
“We don’t provide legal advice, so we encourage our clients to seek counsel on legal considerations in their jurisdiction.” McLaughlin said.
Dig deeper: Why marketers should care about consumer privacy
Web scraping is still a useful adjunct with other forms of data gathering.
For Datamam clients, web scraping is a form of lead generation, Tskaroveli said. It can generate new leads from multiple sources or can be used for data enrichment to allow marketers to gain a beter understanding of their clients, he noted.
Another target for web-scraping bots is influencer marketing campaigns, noted Dhameliya. Here the goal is identifying influencers who fit the marketer’s profile.
“Start slow and add data sources incrementally. Even with our enterprise customers, we’re seeing huge enthusiasm to start with web scraping, as if it were some magic bullet, only to discontinue a portion of the scrapers later because they realize they never needed the data,” Ondra said. “Start monitoring one competitor, and if it works for you, add a second one. Or start with influencers on Instagram and add TikTok later in the process. Treat the web scraped data diligently, like any other data source, and it will give you a competitive edge for sure.”
Opinions expressed in this article are those of the guest author and not necessarily MarTech. Staff authors are listed here.
Related stories