Web scraping with Go

To scrape web pages with Go, we can use goquery library which brings a syntax and features similar to jQuery. We will also see how to use jsonscraper to perform some simple concurrent scraping using only JSON configuration files.

Goquery

To get started, first install it using go get command.

$ go get github.com/PuerkitoBio/goquery

We will use HackerNews website in this example.

This is how structure of our program will look.

package main

import (  
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {  
    doc, err := goquery.NewDocument("https://news.ycombinator.com/")
    if err != nil {
        log.Fatalln(err)
    }
}

We imported formatting and logging libraries, and goquery.
In main we use NewDocument constructor which parses web page and returns *Document and error, if occurred.

Html

To get HTML of any selection, we simply use .Html() which will construct HTML and return it as string.

// HTML
html, err := doc.Html()  
if err != nil {  
    log.Println(err)
} else {
    fmt.Println(html)
}


Text

For getting plain text of selection, there is .Text().
So how do we get Hacker News title ? If we go to the page source, and look closely, we will see that it is inside of b tag which has .hnname class.

// Text
hnname := doc.Find(".hnname").Text()  
fmt.Println(hnname)  


Attributes

That same b tag, also has a link inside of it. To get its attribute, href, there are methods .Attr() and .AttrOr() which returns default string if attribute is not found.

// Link
if newsLink, exists := doc.Find(".hnname a").Attr("href"); exists {  
    fmt.Println(newsLink)
}


Multiple selections

Every submission has a link which has .storylink class. To iterate over each selection, use .Each() which accepts callback function func(int, *Selection).

// Story titles and links
doc.Find(".storylink").Each(func(i int, sel *goquery.Selection) {  
    fmt.Println(sel.Text())
    if attr, exists := sel.Attr("href"); exists {
        fmt.Println(attr)
    }
})



All code used in this section is available on Go Playground.


jsonscraper

To perform simple concurrent scrapes and to get results in JSON file, we can use jsonscraper without writing any code. This program is written in Go and contributions are welcome.

We need to have configuration files, and use them as input arguments to the program.

For example, here is configuration for scraping data similar to the previous section:

{
    "urls": [
        "https://news.ycombinator.com/"
    ],
    "targets": [
        {
            "selector": ".storylink",
            "type": "text",
            "tag": "storyTitleText"
        },
        {
            "selector": ".title",
            "type": "html",
            "tag": "storyTitleHtml"
        },
        {
            "selector": ".storylink",
            "type": "text",
            "tag": "storyTitleWords",
            "submatch": "([a-zA-Z]+)+"
        },
                {
            "selector": ".storylink",
            "type": "attr:href",
            "tag": "storyTitleLinks"
        }
    ],
    "output": {
        "path": "output/$FILENAME"
    }
}

Once the program is invoked with jsonscraper configPath results will be saved in output.path file.

Output example:

{
   "storyTitleHtml":[
      "<span class=\"rank\">1.</span>",
      "<a href=\"https://redditblog.com/2017/04/13/how-we-built-rplace/\" class=\"storylink\">How We Built r/Place</a><span class=\"sitebit comhead\"> (<a href=\"from?site=redditblog.com\"><span class=\"sitestr\">redditblog.com</span></a>)</span>",
      "<span class=\"rank\">2.</span>",
      ...
   ],
   "storyTitleLinks":[
      "https://redditblog.com/2017/04/13/how-we-built-rplace/",
      "http://www.bbc.com/news/science-environment-39592059",
      "https://stripe.com/blog/increment",
      ...
   ],
   "storyTitleText":[
      "How We Built r/Place",
      "Saturn moon 'able to support life'",
      "Introducing Increment",
      ...
   ],
   "storyTitleWords":[
      "How",
      "We",
      "Built",
      ...
   ]
}

For more details and how to write config files check out documentation.

Silvio Simunic

Read more posts by this author.