Sssearch for books like a programmer
I like to read books, and I spend many hours searching for them. So, I automated this process with a Python script that would scrape books from the web.
Last updated: August 22, 2022
I like to read books. Different genres fill my bookshelf: economy, psychology, philosophy, biographies, nonfiction, and especially - science. Bonobo from Frans de Waal's book cover looks at Einstein's prominent face. Before I put anything there, I search. I can spend an hour before I click “buy” in an online bookstore. I look for recommendations everywhere: asking friends, and strangers, browsing through Quora, Medium, and of course - Goodreads. Browsing the web is fun, but every hour spent searching is taking time from actual reading. So, like a real programmer, let's spend other hours automating this process!
Before I write any code, I need to clarify - I'm not a Python developer. I've completed several Python and data science courses and read a book, but that's it. This post is an opportunity to learn more. You can learn with me. A little knowledge of Python and the web is required to continue.
Plan
Let's start with a plan. What do we want to achieve? We want a list of the best books for a particular subject or category. Goodreads has an enormous catalog of books, so we'll use it. Choosing "the best" books will be inter-subjective - I'll use user ratings. It's not like the books on the list will be the best for everyone. But, there is a high possibility we may like it. We will trust the wisdom of the crowd. Let's think about the steps we need to take.
- Visit the Goodreads website.
- Request a list of books for a particular subject.
- Loop through multiple pages.
- Process response of every URL.
- Find DOM elements with an average rating and the number of ratings.
- Create a list of the best books.
- Set subjective principles for adding books to the list.
- Save the list to some file.
Environment
I'll be using Ubuntu, so I have Python preinstalled. If you use another os, here's the download link. For this project, I'll be using Python 3.8.10. I'll also be using third-party Python modules. To avoid any troubles with versions, let's create a virtual environment (for Windows, the steps will differ a bit).
python3 -m venv best-booksIt will contain specific versions of the Python interpreter and other modules. After creating a virtual environment, we need to activate it.
source best-books/bin/activateThe command will put python and pip executables into your shell's PATH. To check if it's working, you can type:
which pythonThere should be a directory:
.../best-books/bin/pythonNow, we're ready to type sssome code. Ok, I'm not going to use this joke again. Especially considering that name comes from the comedy "Monty Python's Flying Circus" and not from a snake.
Third-party modules
I've done some research, and there are many tools for scraping the web with Python. For example, I found Scrapy - a popular, powerful, and efficient framework for building web spiders that can crawl the web. It sounds cool, but for our simple script, it will be overkill. The learning curve is supposedly steep. So, we'll use two simple libraries instead. We'll use requests for. . .well, making HTTP requests. The name isn't exciting, but don't be fooled - it's one of the most popular libraries for Python. Let's install it.
pip install requestsAfter getting responses, we need a tool for extracting data from websites. BeautifulSoup will help us with just that. It's a parser library - it can obtain data from XML and HTML files. And the name is fancier. A modern website can be a beautiful soup of JavaScript, HTML, and CSS, I guess.
pip install beautifulsoup4Python script for web scraping
First, we need to import the get method from the requests library.
from requests import getLet's request the Goodreads website for science books to check if it works.
from requests import get
url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
res = get(url)
print(res.status_code) # 200
print(res.text) # Content of our websitepython3 scrape-books.pyIf you printed the content, you saw a lot of stuff in the terminal. Now we need to import BeautifulSoup to parse it.
from requests import get
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
print(soup.prettify()) # Formatted HTMLNow, our HTML output in the terminal is more readable. Books on this website have a schema attribute. We can use it to grab them all.
from requests import get
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
print(books) # List of books
print(len(books)) # 20We need to find DOM elements containing information about ratings.
<span class="minirating">
<span class="stars staticStars notranslate">
<span class="staticStar p10" size="12x12"> </span>
<span class="staticStar p10" size="12x12"> </span>
<span class="staticStar p10" size="12x12"> </span>
<span class="staticStar p10" size="12x12"> </span>
<span class="staticStar p3" size="12x12"> </span>
</span>
4.13 avg rating — 37,488 ratings
</span>The elements we are looking for have a minirating CSS class. We can use it to extract ratings from books.
from requests import get
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
for book in books:
ratings_string = book.select_one(".minirating").contents[-1]
average_rating, *other, ratings, _ = ratings_string.split()
print(float(average_rating)) # Float like 3.87
print(int(ratings.replace(",", ""))) # Int like 84 006A lot going on here, so let's stop for a moment. For every book on a page, we want to extract info about the average rating and the number of ratings. First, we select the span with this information and grab text from it (last element). We split the string using spaces and used list unpacking to take just numbers. The average rating is first in the list, and the number of ratings is one before the last. We also need to convert the numbers to float and int because they are strings. I know I could use regex or iterate over the list to extract that info and don't rely on positions, but I wanted to keep it simple. Then I also extracted the info about the title, author, and link.
from requests import get
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/search?utf8=%E2%9C%93&query=science"
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
for book in books:
ratings_string = book.select_one(".minirating").contents[-1]
average_rating, *other, ratings, _ = ratings_string.split()
title = book.select_one(".bookTitle").get_text(strip=True)
link = book.select_one(".bookTitle")["href"]
author = book.select_one(".authorName").get_text(strip=True)
print(title, author) # Title and author of every bookNow, we can finally grab only the best books. Let's be picky - to be on our list title needs to have over 4.2 average and over 50000 ratings. We can modify variables: average_rating_threshold and average_threshold to ease our conditions. Before the loop, I created the best_books list. Information about the book is added if both conditions are met.
from requests import get
from bs4 import BeautifulSoup
base_url = "https://www.goodreads.com"
subject = "science"
url = f"{base_url}/search?utf8=%E2%9C%93&query={subject}"
average_rating_threshold = 4.2
ratings_threshold = 50000
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
best_books = []
for book in books:
ratings_string = book.select_one(".minirating").contents[-1]
average_rating, *other, ratings, _ = ratings_string.split()
if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
title = book.select_one(".bookTitle").get_text(strip=True)
link = book.select_one(".bookTitle")["href"]
author = book.select_one(".authorName").get_text(strip=True)
best_book = {"author": author, "title": title,
"average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
best_books.append(best_book)
print(best_books) # List of books with over 4.2 average and over 50000 ratingsWe successfully scraped books from one page. Let's do something even better - let's search ten pages!
from requests import get
from bs4 import BeautifulSoup
base_url = "https://www.goodreads.com"
subject = "science"
average_rating_threshold = 4.2
ratings_threshold = 50000
start_page = 1
stop_page = 11
best_books = []
for page in range(start_page, stop_page):
url = f"{base_url}/search?page={page}&qid=E5tgn4SYZ5&query={subject}&tab=books&utf8=✓"
res = get(url)
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
for book in books:
ratings_string = book.select_one(".minirating").contents[-1]
average_rating, *other, ratings, _ = ratings_string.split()
if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
title = book.select_one(".bookTitle").get_text(strip=True)
link = book.select_one(".bookTitle")["href"]
author = book.select_one(".authorName").get_text(strip=True)
best_book = {"author": author, "title": title,
"average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
best_books.append(best_book)
print(best_books) # List of best books from 10 pagesI modified the URL to consist current page. I iterated over pages with the range function. After scraping books, the last step has left - saving them to a file. Below the scraping loop, I added another snippet that writes down the best books to a markdown file.
# ...
from operator import itemgetter
# ...
subject = "science"
# ...
best_books = [] # Here are scraped books
# ...
file = open(f"best-books-{subject}.md", "w")
file.write(f"## Best books about {subject}\n")
for book in best_books:
title, author, average_rating, ratings, link = itemgetter(
"title", "author", "average_rating", "ratings", "link")(book)
list_item = f'- [{title}]({link})<br />by {author} | <small title="Average rating">{average_rating}⭐</small> <small>{ratings} ratings</small>\n'
file.write(list_item)
file.close()I used Python's built-in functions: open, write, and close to save the best books into a file. The first line in a file is an h2 header with the subject. itemgetter is a function to extract data from a dictionary neatly. Then, I used the data to create a list item for every book. There is some custom HTML markup to make it prettier. And that's it - we have the best books in a file. Here's the output...
## Best books about science
- [The Demon-Haunted World: Science as a Candle in the Dark](https://www.goodreads.com/book/show/17349.The_Demon_Haunted_World?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=2)<br />by Carl Sagan | <small title="Average rating">4.28⭐</small> <small>67,616 ratings</small>
- [How to Change Your Mind: What the New Science of Psychedelics Teaches Us About Consciousness, Dying, Addiction, Depression, and Transcendence](https://www.goodreads.com/book/show/36613747-how-to-change-your-mind?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=3)<br />by Michael Pollan | <small title="Average rating">4.24⭐</small> <small>62,544 ratings</small>
- [Why We Sleep: The New Science of Sleep and Dreams](https://www.goodreads.com/book/show/36303871-why-we-sleep?from_search=true&from_srp=true&qid=E5tgn4SYZ5&rank=129)<br />by Matthew Walker | <small title="Average rating">4.38⭐</small> <small>140,971 ratings</small>...and how it is displayed:
Best books about science
- The Demon-Haunted World: Science as a Candle in the Dark
by Carl Sagan | 4.28⭐ 67,616 ratings - How to Change Your Mind: What the New Science of Psychedelics Teaches Us About Consciousness, Dying, Addiction, Depression, and Transcendence
by Michael Pollan | 4.24⭐ 62,544 ratings - Why We Sleep: The New Science of Sleep and Dreams
by Matthew Walker | 4.38⭐ 140,971 ratings
Final script
The final version of our script looks like this:
from bs4 import BeautifulSoup
from operator import itemgetter
from requests import get
def scrape_books(subject="science", start_page=1, stop_page=11, average_rating_threshold=4.2, ratings_threshold=50000):
if not (isinstance(subject, str) and isinstance(start_page, int) and isinstance(stop_page, int) and isinstance(average_rating_threshold, float) and isinstance(ratings_threshold, int)):
raise TypeError("Incompatible types of arguments")
if not (len(subject) > 0 and start_page > 0 and stop_page > 0 and stop_page > start_page and 5.0 >= average_rating_threshold >= 0.0 and ratings_threshold > 0):
raise TypeError("Incompatible values of arguments")
try:
base_url = "https://www.goodreads.com"
best_books = []
for page in range(start_page, stop_page):
url = f"{base_url}/search?page={page}&qid=E5tgn4SYZ5&query={subject}&tab=books&utf8=✓"
res = get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "html.parser")
books = soup.select('[itemtype="http://schema.org/Book"]')
for book in books:
ratings_string = book.select_one(".minirating").contents[-1]
average_rating, *_, ratings, _ = ratings_string.split()
if float(average_rating) > average_rating_threshold and int(ratings.replace(",", "")) > ratings_threshold:
title = book.select_one(".bookTitle").get_text(strip=True)
link = book.select_one(".bookTitle")["href"]
author = book.select_one(
".authorName").get_text(strip=True)
best_book = {"author": author, "title": title,
"average_rating": average_rating, "ratings": ratings, "link": f"{base_url}{link}"}
best_books.append(best_book)
return best_books
except Exception as err:
print(f"There was a problem during scraping books: {err}")
def save_books(book_list=[], subject="subject"):
if not (isinstance(book_list, list) and isinstance(subject, str)):
raise TypeError("Incompatible types of arguments")
if len(book_list) > 0:
file = open(f"best-books-{subject}.md", "w")
try:
file.write(f"## Best books about {subject}\n")
for book in book_list:
title, author, average_rating, ratings, link = itemgetter(
"title", "author", "average_rating", "ratings", "link")(book)
list_item = f'- [{title}]({link})<br />by {author} | <small title="Average rating">{average_rating}⭐</small> <small>{ratings} ratings</small>\n'
file.write(list_item)
finally:
file.close()
best_books = scrape_books()
save_books(best_books, "science")I've refactored it a bit. I've put the logic for scraping and saving books into separate functions. They have default parameters, so you don't need to specify them all. You can call functions multiple times for different subjects. I've also added some basic error handling.
Bonus
I thought running the script directly from the terminal without modifying it would be cool. So, I used custom line arguments to provide the necessary data. We can capture info passed to the script with the argv module. The first argument is a script name, and the others are custom.
# ...
from sys import argv
subject = argv[1]
start_page = int(argv[2])
stop_page = int(argv[3])
average_rating_threshold = float(argv[4])
rating_threshold = int(argv[5])
# ...
best_books = scrape_books(subject, start_page, stop_page,
average_rating_threshold, rating_threshold)
save_books(best_books, subject)Now we scrape books by typing to the terminal something like this:
python3 scrape-books.py science 1 11 4.2 50000
August 15, 2023 • 12 min read
September 9, 2025 • 23 min read
October 13, 2022 • 11 min read