Colly Web Scraping in Golang: A Practical Tutorial for Beginners

Okan Özşahin
4 min readOct 1, 2023

--

https://webscraping.ai/blog/web-scraping-for-machine-learning

Web scraping is a technique used to extract data from websites, and it can be a powerful tool for various purposes such as data analysis, research, and automation. Golang, a popular programming language known for its efficiency and concurrency support, offers a powerful web scraping library called Colly. In this guide, we will explore how to use Colly to build web scrapers in Golang.

Colly is a highly customizable and user-friendly scraping library for Golang. It simplifies the process of making HTTP requests, parsing HTML documents, and extracting data from websites. Colly provides a range of features that enable developers to navigate web pages, select and filter elements using CSS selectors, and handle different types of data extraction tasks.

Below is a step-by-step guide on how to get started with web scraping using Colly in Golang:

Step 1: Install Go and Set Up a Go Project

First, make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org/dl/

After installation, set up a new Go project by creating a directory for your project and initializing it using go mod:

mkdir my-scraper
cd my-scraper
go mod init my-scraper

Step 2: Install the Colly Library

You can install Colly using the go get command:

go get github.com/gocolly/colly/v2

Step 3: Create Your Web Scraper

Now, let’s create a simple web scraper using Colly to scrape data from a website. In example, we’ll scrape quotes from http://quotes.toscrape.com.

Create a new Go file (e.g., main.go) in your project directory and add the following code:

package main

import (
"fmt"
"log"
"strings"

"github.com/gocolly/colly/v2"
)

func main() {
// Create a new Colly collector
c := colly.NewCollector()

// Define the URL you want to scrape
url := "http://quotes.toscrape.com/page/1/"

// Set up callbacks to handle scraping events
c.OnHTML(".quote", func(e *colly.HTMLElement) {
// Extract data from HTML elements
quote := e.ChildText("span.text")
author := e.ChildText("small.author")
tags := e.ChildText("div.tags")

// Clean up the extracted data
quote = strings.TrimSpace(quote)
author = strings.TrimSpace(author)
tags = strings.TrimSpace(tags)

// Print the scraped data
fmt.Printf("Quote: %s\nAuthor: %s\nTags: %s\n\n", quote, author, tags)
})

// Visit the URL and start scraping
err := c.Visit(url)
if err != nil {
log.Fatal(err)
}
}

In this code:

c := colly.NewCollector() we create a new Colly collector named c. The collector is the core component of Colly, responsible for making HTTP requests and handling the scraping process.

url := “http://quotes.toscrape.com/page/1/" This line defines the URL you want to scrape. In this example, we’re using the URL of the “Quotes to Scrape” website, specifically page 1.

c.OnHTML(“.quote”, func(e *colly.HTMLElement) { This block sets up a callback function using OnHTML. The callback will be executed whenever an HTML element with the class "quote" is encountered on the page.

Inside the callback, we use the e.ChildText(selector) method to extract text content from specific HTML elements based on CSS selectors. In this case:

quote := e.ChildText(“span.text”) is assigned the text content of the <span class="text"> element within the "quote" element.

author := e.ChildText(“small.author”) is assigned the text content of the <small class="author"> element within the "quote" element.

tags := e.ChildText(“div.tags”) is assigned the text content of the <div class="tags"> element within the "quote" element.

After extracting the data, we use strings.TrimSpace() to remove leading and trailing whitespace from each of the extracted strings. This helps clean up the data.

Finally, we call c.Visit(url) to initiate the scraping process by visiting the specified URL. If there is any error during the scraping process, we use log.Fatal(err) to log the error and exit the program.

Step 4: Run Your Web Scraper

To run your web scraper, simply execute the following command in your project directory:

go run main.go

Your scraper will fetch data from the specified URL and display the scraped quotes, authors, and tags on the console.

That’s it! You’ve created a basic web scraper using the Colly library in Golang. You can customize and expand this scraper to suit your specific web scraping needs.

In the Colly web scraping library for Golang, a callback is a user-defined function that gets executed when specific events occur during the scraping process. Callbacks allow you to define custom logic to handle various aspects of web scraping, such as extracting data from specific HTML elements, handling requests, logging, and more.

Colly provides several types of callbacks, including:

  1. OnHTML(callback func(e *HTMLElement)): This callback is executed when an HTML element that matches a specified selector is found on the page. You can extract and process data from the HTML element using this callback.
  2. OnXML(callback func(e *XMLElement)): Similar to OnHTML but for XML elements.
  3. OnScraped(callback func(r *Response)): This callback is executed after a page has been scraped, and you can perform post-processing tasks on the scraped data.
  4. OnError(callback func(r *Response, err error)): This callback is executed when an error occurs during the scraping process. You can use it to handle errors gracefully.
  5. OnRequest(callback func(r *Request)): This callback is executed before making an HTTP request. It can be used to modify request headers, log requests, or perform other actions before sending the request.
  6. OnResponse(callback func(r *Response)): This callback is executed when a response is received from an HTTP request. It can be used to inspect and process the HTTP response.
  7. OnHTMLDetach(callback func(e *HTMLElement)): Similar to OnHTML but for elements in detached subtrees of the HTML document.
  8. OnXMLDetach(callback func(e *XMLElement)): Similar to OnXML but for elements in detached subtrees of the XML document.

Callbacks are registered with the Colly collector by calling methods like OnHTML, OnRequest, etc., and passing the callback function as an argument. When the specified event occurs during the scraping process (e.g., when an HTML element matches the selector specified in OnHTML), the associated callback function is invoked, allowing you to define how to handle that event.

--

--

Okan Özşahin

Backend Developer at hop | Civil Engineer | MS Computer Engineering