Sssearch for books like a programmer

Hand with a magnifying glass inspecting the laptop keyboard — Photo by Agence Olloweb

I like to read books. Different genres fill my bookshelf: economy, psychology, philosophy, biographies, nonfiction, and especially - science. Bonobo from Frans de Waal’s book cover looks at Einstein’s prominent face. Before I put anything there, I search. I can spend an hour before I click “buy” in an online bookstore. I look for recommendations everywhere: asking friends, and strangers, browsing through Quora, Medium, and of course - Goodreads. Browsing the web is fun, but every hour spent searching is taking time from actual reading. So, like a real programmer, let’s spend other hours automating this process!

Before I write any code, I need to clarify - I’m not a Python developer. I’ve completed several Python and data science courses and read a book, but that’s it. This post is an opportunity to learn more. You can learn with me. A little knowledge of Python and the web is required to continue.

Plan

Let’s start with a plan. What do we want to achieve? We want a list of the best books for a particular subject or category. Goodreads has an enormous catalog of books, so we’ll use it. Choosing “the best” books will be inter-subjective - I’ll use user ratings. It’s not like the books on the list will be the best for everyone. But, there is a high possibility we may like it. We will trust the wisdom of the crowd. Let’s think about the steps we need to take.

Visit the Goodreads website.
Request a list of books for a particular subject.
Loop through multiple pages.
Process response of every URL.
Find DOM elements with an average rating and the number of ratings.
Create a list of the best books.
Set subjective principles for adding books to the list.
Save the list to some file.

Environment

I’ll be using Ubuntu, so I have Python preinstalled. If you use another os, here’s the download link. For this project, I’ll be using Python 3.8.10. I’ll also be using third-party Python modules. To avoid any troubles with versions, let’s create a virtual environment (for Windows, the steps will differ a bit).

🔴  🟡  🟢
python3 -m venv best-books

It will contain specific versions of the Python interpreter and other modules. After creating a virtual environment, we need to activate it.

🔴  🟡  🟢
source best-books/bin/activate

The command will put python and pip executables into your shell’s PATH. To check if it’s working, you can type:

🔴  🟡  🟢
which python

There should be a directory:

🔴  🟡  🟢
.../best-books/bin/python

Now, we’re ready to type sssome code. Ok, I’m not going to use this joke again. Especially considering that name comes from the comedy “Monty Python’s Flying Circus” and not from a snake.

Third-party modules

I’ve done some research, and there are many tools for scraping the web with Python. For example, I found Scrapy - a popular, powerful, and efficient framework for building web spiders that can crawl the web. It sounds cool, but for our simple script, it will be overkill. The learning curve is supposedly steep. So, we’ll use two simple libraries instead. We’ll use requests for. . .well, making HTTP requests. The name isn’t exciting, but don’t be fooled - it’s one of the most popular libraries for Python. Let’s install it.

🔴  🟡  🟢
pip install requests

After getting responses, we need a tool for extracting data from websites. BeautifulSoup will help us with just that. It’s a parser library - it can obtain data from XML and HTML files. And the name is fancier. A modern website can be a beautiful soup of JavaScript, HTML, and CSS, I guess.

🔴  🟡  🟢
pip install beautifulsoup4

Python script for web scraping

First, we need to import the get method from the requests library.

PYTHON
1from requests import get

Let’s request the Goodreads website for science books to check if it works.

PYTHON
1from requests import get
2
3url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
4res = get(url)
5
6print(res.status_code)  # 200
7print(res.text)  # Content of our website

🔴  🟡  🟢
python3 scrape-books.py

If you printed the content, you saw a lot of stuff in the terminal. Now we need to import BeautifulSoup to parse it.

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
5res = get(url)
6soup = BeautifulSoup(res.text, "html.parser")
7
8print(soup.prettify())  # Formatted HTML

Now, our HTML output in the terminal is more readable. Books on this website have a schema attribute. We can use it to grab them all.

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
5res = get(url)
6soup = BeautifulSoup(res.text, "html.parser")
7books = soup.select('[itemtype="http://schema.org/Book"]')
8
9print(books)  # List of books
10print(len(books))  # 20

We need to find DOM elements containing information about ratings.

HTML
1<span class="minirating">
2  <span class="stars staticStars notranslate">
3    <span class="staticStar p10" size="12x12"> </span>
4    <span class="staticStar p10" size="12x12"> </span>
5    <span class="staticStar p10" size="12x12"> </span>
6    <span class="staticStar p10" size="12x12"> </span>
7    <span class="staticStar p3" size="12x12"> </span>
8  </span>
9  4.13 avg rating — 37,488 ratings
10</span>

The elements we are looking for have a minirating CSS class. We can use it to extract ratings from books.

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
5res = get(url)
6soup = BeautifulSoup(res.text, "html.parser")
7books = soup.select('[itemtype="http://schema.org/Book"]')
8
9for book in books:
10    ratings_string = book.select_one(".minirating").contents[-1]
11    average_rating, *other, ratings, _ = ratings_string.split()
12    print(float(average_rating))  # Float like 3.87
13    print(int(ratings.replace(",", "")))  # Int like 84 006

A lot going on here, so let’s stop for a moment. For every book on a page, we want to extract info about the average rating and the number of ratings. First, we select the span with this information and grab text from it (last element). We split the string using spaces and used list unpacking to take just numbers. The average rating is first in the list, and the number of ratings is one before the last. We also need to convert the numbers to float and int because they are strings. I know I could use regex or iterate over the list to extract that info and don’t rely on positions, but I wanted to keep it simple. Then I also extracted the info about the title, author, and link.

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
5res = get(url)
6soup = BeautifulSoup(res.text, "html.parser")
7books = soup.select('[itemtype="http://schema.org/Book"]')
8
9for book in books:
10    ratings_string = book.select_one(".minirating").contents[-1]
11    average_rating, *other, ratings, _ = ratings_string.split()
12    title = book.select_one(".bookTitle").get_text(strip=True)
13        link = book.select_one(".bookTitle")["href"]
14    author = book.select_one(".authorName").get_text(strip=True)
15    print(title, author)  # Title and author of every book

Now, we can finally grab only the best books. Let’s be picky - to be on our list title needs to have over 4.2 average and over 50000 ratings. We can modify variables: average_rating_threshold and average_threshold to ease our conditions. Before the loop, I created the best_books list. Information about the book is added if both conditions are met.

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4base_url = "https://www.goodreads.com"
5subject = "science"
6url = f"{base_url}/search?utf8=%E2%9C%93&query={subject}"
7average_rating_threshold = 4.2
8ratings_threshold = 50000
9
10res = get(url)
11soup = BeautifulSoup(res.text, "html.parser")
12books = soup.select('[itemtype="http://schema.org/Book"]')
13
14best_books = []
15for book in books:
16    ratings_string = book.select_one(".minirating").contents[-1]
17    average_rating, *other, ratings, _ = ratings_string.split()
18
19    if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
20        title = book.select_one(".bookTitle").get_text(strip=True)
21        link = book.select_one(".bookTitle")["href"]
22        author = book.select_one(".authorName").get_text(strip=True)
23        best_book = {"author": author, "title": title,
24                     "average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
25
26        best_books.append(best_book)
27
28print(best_books)  # List of books with over 4.2 average and over 50000 ratings

We successfully scraped books from one page. Let’s do something even better - let’s search ten pages!

PYTHON
1from requests import get
2from bs4 import BeautifulSoup
3
4base_url = "https://www.goodreads.com"
5subject = "science"
6average_rating_threshold = 4.2
7ratings_threshold = 50000
8start_page = 1
9stop_page = 11
10
11best_books = []
12for page in range(start_page, stop_page):
13    url = f"{base_url}/search?page={page}&qid=E5tgn4SYZ5&query={subject}&tab=books&utf8=✓"
14    res = get(url)
15    soup = BeautifulSoup(res.text, "html.parser")
16    books = soup.select('[itemtype="http://schema.org/Book"]')
17
18    for book in books:
19        ratings_string = book.select_one(".minirating").contents[-1]
20        average_rating, *other, ratings, _ = ratings_string.split()
21
22        if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
23            title = book.select_one(".bookTitle").get_text(strip=True)
24            link = book.select_one(".bookTitle")["href"]
25            author = book.select_one(".authorName").get_text(strip=True)
26            best_book = {"author": author, "title": title,
27                         "average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
28
29            best_books.append(best_book)
30
31print(best_books)  # List of best books from 10 pages

I modified the URL to consist current page. I iterated over pages with the range function. After scraping books, the last step has left - saving them to a file. Below the scraping loop, I added another snippet that writes down the best books to a markdown file.

PYTHON
1...
2from operator import itemgetter
3
4...
5
6subject = "science"
7
8...
9
10best_books = [] # Here are scraped books
11
12...
13
14file = open(f"best-books-{subject}.md", "w")
15file.write(f"## Best books about {subject}\n")
16for book in best_books:
17    title, author, average_rating, ratings, link = itemgetter(
18        "title", "author", "average_rating", "ratings", "link")(book)
19    list_item = f'- [{title}]({link})<br />by {author} | <small title="Average rating">{average_rating}⭐</small> <small>{ratings} ratings</small>\n'
20    file.write(list_item)
21
22file.close()

I used Python’s built-in functions: open, write, and close to save the best books into a file. The first line in a file is an h2 header with the subject. itemgetter is a function to extract data from a dictionary neatly. Then, I used the data to create a list item for every book. There is some custom HTML markup to make it prettier. And that’s it - we have the best books in a file. Here’s the output and how it is displayed:

MARKDOWN
1## Best books about science
2
3- [The Demon-Haunted World: Science as a Candle in the Dark](https://www.goodreads.com/book/show/17349.The_Demon_Haunted_World?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=2)<br />by Carl Sagan | <small title="Average rating">4.28⭐</small> <small>67,616 ratings</small>
4- [How to Change Your Mind: What the New Science of Psychedelics Teaches Us About Consciousness, Dying, Addiction, Depression, and Transcendence](https://www.goodreads.com/book/show/36613747-how-to-change-your-mind?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=3)<br />by Michael Pollan | <small title="Average rating">4.24⭐</small> <small>62,544 ratings</small>
5- [Why We Sleep: The New Science of Sleep and Dreams](https://www.goodreads.com/book/show/36303871-why-we-sleep?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=129)<br />by Matthew Walker | <small title="Average rating">4.38⭐</small> <small>140,971 ratings</small>

Best books about science

The Demon-Haunted World: Science as a Candle in the Dark
by Carl Sagan | 4.28⭐ 67,616 ratings
How to Change Your Mind: What the New Science of Psychedelics Teaches Us About Consciousness, Dying, Addiction, Depression, and Transcendence
by Michael Pollan | 4.24⭐ 62,544 ratings
Why We Sleep: The New Science of Sleep and Dreams
by Matthew Walker | 4.38⭐ 140,971 ratings

Final script

The final version of our script looks like this:

PYTHON
1from bs4 import BeautifulSoup
2from operator import itemgetter
3from requests import get
4
5def scrape_books(subject="science", start_page=1, stop_page=11, average_rating_threshold=4.2, ratings_threshold=50000):
6    if not (isinstance(subject, str) and isinstance(start_page, int) and isinstance(stop_page, int) and isinstance(average_rating_threshold, float) and isinstance(ratings_threshold, int)):
7        raise TypeError("Incompatible types of arguments")
8
9    if not (len(subject) > 0 and start_page > 0 and stop_page > 0 and stop_page > start_page and 5.0 >= average_rating_threshold >= 0.0 and ratings_threshold > 0):
10        raise TypeError("Incompatible values of arguments")
11
12    try:
13        base_url = "https://www.goodreads.com"
14        best_books = []
15        for page in range(start_page, stop_page):
16            url = f"{base_url}/search?page={page}&qid=E5tgn4SYZ5&query={subject}&tab=books&utf8=✓"
17            res = get(url)
18            res.raise_for_status()
19            soup = BeautifulSoup(res.text, "html.parser")
20            books = soup.select('[itemtype="http://schema.org/Book"]')
21
22            for book in books:
23                ratings_string = book.select_one(".minirating").contents[-1]
24                average_rating, *_, ratings, _ = ratings_string.split()
25
26                if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
27                    title = book.select_one(".bookTitle").get_text(strip=True)
28                    link = book.select_one(".bookTitle")["href"]
29                    author = book.select_one(
30                        ".authorName").get_text(strip=True)
31                    best_book = {"author": author, "title": title,
32                                 "average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
33                    best_books.append(best_book)
34
35        return best_books
36
37    except Exception as err:
38        print(f"There was a problem during scraping books: {err}")
39
40def save_books(book_list=[], subject="subject"):
41    if not (isinstance(book_list, list) and isinstance(subject, str)):
42        raise TypeError("Incompatible types of arguments")
43    if len(book_list) > 0:
44        file = open(f"best-books-{subject}.md", "w")
45        try:
46            file.write(f"## Best books about {subject}\n")
47            for book in book_list:
48                title, author, average_rating, ratings, link = itemgetter(
49                    "title", "author", "average_rating", "ratings", "link")(book)
50                list_item = f'- [{title}]({link})<br />by {author} | <small title="Average rating">{average_rating}⭐</small> <small>{ratings} ratings</small>\n'
51                file.write(list_item)
52        finally:
53            file.close()
54
55best_books = scrape_books()
56save_books(best_books, "science")

I’ve refactored it a bit. I’ve put the logic for scraping and saving books into separate functions. They have default parameters, so you don’t need to specify them all. You can call functions multiple times for different subjects. I’ve also added some basic error handling.

Bonus

I thought running the script directly from the terminal without modifying it would be cool. So, I used custom line arguments to provide the necessary data. We can capture info passed to the script with the argv module. The first argument is a script name, and the others are custom.

PYTHON
1...
2from sys import argv
3
4subject = argv[1]
5start_page = int(argv[2])
6stop_page = int(argv[3])
7average_rating_threshold = float(argv[4])
8rating_threshold = int(argv[5])
9
10...
11
12best_books = scrape_books(subject, start_page, stop_page,
13                          average_rating_threshold, rating_threshold)
14save_books(best_books, subject)

Now we scrape books by typing to the terminal something like this:

🔴  🟡  🟢
python3 scrape-books.py science 1 11 4.2 50000

Sssearch for books like a programmer

Plan

Environment

Third-party modules

Python script for web scraping

Best books about science

Final script

Bonus

Read more. Stay curious

Intro to AI

Object-Oriented Programming in JavaScript

Object-Oriented Programming in TypeScript

Plan

Environment

Third-party modules

Python script for web scraping

Best books about science

Final script

Bonus

Intro to AI

Object-Oriented Programming in JavaScript

Object-Oriented Programming in TypeScript

A newsletter that sparks curiosity💡