A Winning Combination: Colly and Chrome Impersonation for Effective Web Scraping

Okan Özşahin
4 min readOct 25, 2023

--

Web scraping involves extracting data from websites, and it can sometimes be challenging due to various factors like access restrictions and the need to mimic human-like behavior when making requests. Colly is a popular Go web scraping framework that provides a structured and organized way to navigate and extract data from web pages. It simplifies the scraping process by allowing us to define rules for how to interact with a webpage’s HTML, making it easier to target specific data. We can set up event handlers to process elements, follow links, and more. Many websites implement security measures to block or restrict access by automated bots, including scraping tools. By impersonating Chrome headers, we make our scraping requests appear more like those of a legitimate web browser. This helps in bypassing access restrictions or blocks imposed by the web server. Websites are often less likely to block requests that mimic the behavior of a common web browser. By combining Colly’s structured scraping capabilities with the ability to impersonate Chrome headers using libraries like req, we gain the advantages of both worlds: Colly provides a framework for easily defining what data we want to scrape and how to navigate web pages and Impersonating Chrome headers enhances our scraping setup by making our requests seem more like those of a regular web browser, reducing the chances of getting blocked.

When you receive a 403 Forbidden error during web scraping, it indicates that the web server has denied your access. Web servers often send such errors to restrict or block access by bots or automated scraping processes. In this situation, it’s important to adjust the behavior of the HTTP client to prevent rejection by the server. Configurations like req.DefaultClient().ImpersonateChrome() from github.com/imroc/req can help you achieve this, making your scraping request appear more like a human or a browser. It allows you to adjust your scraping request to behave more like a human, potentially reducing the likelihood of the web server rejecting your request and increasing the success of your scraping process.

import (
"github.com/gocolly/colly/v2"
"github.com/imroc/req/v3"
)
 fakeChrome := req.DefaultClient().ImpersonateChrome()

c := colly.NewCollector(
colly.MaxDepth(maxScrapeDepth),
colly.UserAgent(fakeChrome.Headers.Get("user-agent")),
)
c.SetClient(&http.Client{
Transport: fakeChrome.Transport,
})

In our code snippet, we create a fake Chrome client using req.DefaultClient().ImpersonateChrome(). This sets up an HTTP client configuration that mimics the behavior of the Chrome web browser, which can be helpful in bypassing restrictions or access blocks imposed by web servers. The fakeChrome client is configured to use the User-Agent header that's typical of a Chrome browser.

Next, we initialize a Colly collector c with certain settings. We specify the maximum depth for the scraping operation using colly.MaxDepth(maxScrapeDepth). The maximum depth controls how deep into a website's structure the scraper will navigate. We also set the User-Agent of the collector to the User-Agent string obtained from the fakeChrome client. This is essential for making our scraping requests appear as though they are coming from a Chrome browser.

Finally, we configure the HTTP client used by the Colly collector to use the transport and settings from the fakeChrome client. This ensures that the Colly collector uses the Chrome-like settings and headers when sending HTTP requests.

c.OnHTML(".info-box-number", func(e *colly.HTMLElement) {
text := strings.TrimSpace(e.Text)
fmt.Println(text)
// or do some actions
})

c.OnRequest(func(r *colly.Request) {
log.Println("Visiting", r.URL)
})

err = c.Visit("https://example.com")
if err != nil {
return res, fmt.Errorf("failed to visit page https://example.com: %v", err)
}

This code snippet appears to be a continuation of the previous code using the Colly web scraping framework to scrape a web page. Let me explain it step by step:

  1. c.OnHTML(".info-box-number", func(e *colly.HTMLElement) { ... }): This part of the code defines an event handler using Colly. It specifies that when the HTML of the web page being scraped contains an element with the class "info-box-number," the provided function will be executed. Inside this function, the element's text content is trimmed and printed, or you can perform any desired actions on it.
  2. c.OnRequest(func(r *colly.Request) { ... }): Another event handler is defined here. This one is triggered whenever a request is made. It logs the URL that is about to be visited. This can be useful for tracking and debugging purposes.
  3. err = c.Visit("https://example.com"): This line initiates the web scraping process by instructing the Colly collector (c) to visit the URL "https://example.com." The c.Visit method sends an HTTP GET request to the specified URL and triggers the scraping process.
  4. if err != nil { ... }: After the c.Visit call, the code checks if there was an error during the scraping process. If an error occurs, it returns an error message indicating the failure to visit the page and includes the specific error information.

By combining Colly’s scraping capabilities with the ability to impersonate Chrome using req.DefaultClient().ImpersonateChrome(), we can create a powerful scraping setup that is more likely to bypass certain access restrictions, making it a valuable approach for web scraping tasks. However, as always, it's important to be mindful of the website's terms of use and respect any scraping policies or restrictions they have in place.

--

--

Okan Özşahin

Backend Developer at hop | Civil Engineer | MS Computer Engineering