Combining books with Python and SQL

Nadine Mukondiwa
May 4, 2021
2 min read

The COVID-19 pandemic led to new hobbies or going back to old hobbies for many people. I reignited my interest in reading (thanks to the OverDrive app, which allows you to access books from your local library). However, I also started learning how to code in Python through YouTube, FreeCodeCamp, DataCamp and many other resources.

The project

Sometimes, I don't know what books to read, so I use Google for suggestions. So, when I found the New York Times Books API, I decided to create a database of books that I could reference instead of clicking through multiple web pages. I thought this project would be a great way to keep working on my Python and SQL skills.

Getting the data

To start the project, I imported the pandas, datetime, requests python libraries. I wanted to get all of the books from January 2020 to May 2021. I used the datetime library to create a list of all of the Sundays between that period (the weekly NYT lists are published every Sunday). I then created two functions to make requests to the Books API and return a list of dataframes with the data like the publisher, book title and author.

Transforming the data

After combining the dataframes, I began the data cleaning and transformation process. I separated the data into smaller dataframes, such as the authors and publishers dataframes and then converted them into CSV files for database loading.

#Cleaning the data
df['publisher'] = df.publisher.str.strip()
df['title'] = df.title.str.capitalize().str.strip()
df['author'] = df.author.str.strip()
df['contributor'] = df.contributor.str.replace('by', '').str.strip()

Creating the database

In between the data transformation work, I created the database in MySQL. I used Lucidchart to create an entity-relationship diagram. The last table I made was the list_entries table because it contained foreign keys from other tables.

CREATE TABLE list_entries (
    list_entry_id INT PRIMARY KEY NOT NULL,
    date_published DATE NOT NULL,
    book_id INT NOT NULL,
    author_id INT NOT NULL,
    publisher_id INT NOT NULL,
    contributor VARCHAR(100), 
    list_id INT NOT NULL, 
    weeks_on_list INT,
    age_group VARCHAR(50));

Loading the database

To complete the database creation process, I loaded the CSV files into each table using the code below.

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/list_entries.csv'
INTO TABLE list_entries
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

Querying the database

When I finished creating the database, I started writing a few queries. I realized that the books and authors tables would have to be joined multiple times, so I created a view of the joined tables.

CREATE VIEW book_author AS 
	(SELECT
		a.author_id,
        b.book_id,
        b.title,
        a.author_name
    FROM authors a
	LEFT JOIN books b ON a.author_id = b.author_id);

I wanted to find out which authors had more than one book on the Bestseller lists, so I created this query.

# Authors with more than 1 'New York Times bestseller'
SELECT
    author_name,
    author_id,
	COUNT(book_id) AS 'Number of books'
FROM book_author
GROUP BY author_id, author_name
HAVING COUNT(book_id) > 1
ORDER BY COUNT(book_id) DESC;

Danielle Stelle had the highest with eight books! You can take a look at the visualization I created in Tableau for a few book suggestions.

I have enjoyed working on this project. It is still a work in progress because I will continue to expand on it as I learn more about data analysis and data engineering concepts. The notebooks are available on my Github page.