Combining books with Python and SQL
- Nadine Mukondiwa
- May 4, 2021
- 2 min read
The COVID-19 pandemic led to new hobbies or going back to old hobbies for many people. I reignited my interest in reading (thanks to the OverDrive app, which allows you to access books from your local library). However, I also started learning how to code in Python through YouTube, FreeCodeCamp, DataCamp and many other resources.
The project
Sometimes, I don't know what books to read, so I use Google for suggestions. So, when I found the New York Times Books API, I decided to create a database of books that I could reference instead of clicking through multiple web pages. I thought this project would be a great way to keep working on my Python and SQL skills.
Getting the data
To start the project, I imported the pandas, datetime, requests python libraries. I wanted to get all of the books from January 2020 to May 2021. I used the datetime library to create a list of all of the Sundays between that period (the weekly NYT lists are published every Sunday). I then created two functions to make requests to the Books API and return a list of dataframes with the data like the publisher, book title and author.
Transforming the data
After combining the dataframes, I began the data cleaning and transformation process. I separated the data into smaller dataframes, such as the authors and publishers dataframes and then converted them into CSV files for database loading.
#Cleaning the data
df['publisher'] = df.publisher.str.strip()
df['title'] = df.title.str.capitalize().str.strip()
df['author'] = df.author.str.strip()
df['contributor'] = df.contributor.str.replace('by', '').str.strip()
Creating the database
In between the data transformation work, I created the database in MySQL. I used Lucidchart to create an entity-relationship diagram. The last table I made was the list_entries table because it contained foreign keys from other tables.
CREATE TABLE list_entries (
list_entry_id INT PRIMARY KEY NOT NULL,
date_published DATE NOT NULL,
book_id INT NOT NULL,
author_id INT NOT NULL,
publisher_id INT NOT NULL,
contributor VARCHAR(100),
list_id INT NOT NULL,
weeks_on_list INT,
age_group VARCHAR(50));
Loading the database
To complete the database creation process, I loaded the CSV files into each table using the code below.
LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/list_entries.csv'
INTO TABLE list_entries
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;
Querying the database
When I finished creating the database, I started writing a few queries. I realized that the books and authors tables would have to be joined multiple times, so I created a view of the joined tables.
CREATE VIEW book_author AS
(SELECT
a.author_id,
b.book_id,
b.title,
a.author_name
FROM authors a
LEFT JOIN books b ON a.author_id = b.author_id);
I wanted to find out which authors had more than one book on the Bestseller lists, so I created this query.
# Authors with more than 1 'New York Times bestseller'
SELECT
author_name,
author_id,
COUNT(book_id) AS 'Number of books'
FROM book_author
GROUP BY author_id, author_name
HAVING COUNT(book_id) > 1
ORDER BY COUNT(book_id) DESC;
Danielle Stelle had the highest with eight books! You can take a look at the visualization I created in Tableau for a few book suggestions.
I have enjoyed working on this project. It is still a work in progress because I will continue to expand on it as I learn more about data analysis and data engineering concepts. The notebooks are available on my Github page.
댓글