GoodReads Data Analysis using Python: 2022 Books Recap

Analysing my reading statistics for the year 2022 and revealing my favourite book that I read this year!

Published in

The Grim Reader

8 min readDec 29, 2022

“Sometimes, you read a book and it fills you with this weird evangelical zeal, and you become convinced that the shattered world will never be put back together unless and until all living humans read the book.” — John Green

My relationship with reading books has been similar to what John described in the above quote. Most books I read leave me with a zeal to go out and shout the contents of the book to make people understand the deepest secrets held between those pages. My friends know how much I love reading and how often I annoy them with my rants about certain book passages. Every year, I usually recap the books I read on GoodReads and reminisce over the pages that felt close to me. Goodreads stats provide a good comprehensive view of your year, but this time, I wanted to play with that data myself and see what I can find out about my reading patterns. So, I decided to use my rudimentary skills in Python to analyse the trends from my data. I read 54 books this year, so it was fun to visualise the data using Pandas library. Let’s see what I did with the data. Do forgive me for my brute methods, but I am an amateur (the word comes from the Latin word ‘amare’ — to love doing) at data analysis, so bear with me.

So the first thing I did was to export my data from Goodreads, which was pretty easy to do (Export). Then, I imported four python libraries (Pandas, Numpy, Matplotlib and Seaborn).

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

goodreads_file = pd.read_csv("goodreads_library_export.csv")

Now, next, I wanted only to have data for the books read in 2022 and then dropped the unnecessary columns. I analysed to see the average rating I gave to 54 books that I read and what was the average rating by GoodReads users for the same 54 books.

My Average Rating — 4.40 and Goodreads Average Rating-4.13

So as it turns out, I tend to give much higher ratings than an average GoodReads user, which is definitely an indicator of how I am generous with the books I read. I do tend to love books more if they invoke certain feelings in me (Especially tears, looking at you ‘A little life’). We will see more on ratings later on.

#Converting the goodreads 'Date Read' column into datetime format
books_2022 = goodreads_file[pd.to_datetime(goodreads_file['Date Read'], format = "%Y/%m/%d") >= "2022-01-01"]
books_2022 = books_2022.reset_index(drop = True)
#dropping the columns we don't need
books_2022 = books_2022.drop(['Publisher','Binding','Book Id','Author l-f','Additional Authors','ISBN','ISBN13' ,'Bookshelves','Bookshelves with positions', 'Exclusive Shelf', 'My Review','Spoiler','Private Notes','Read Count', 'Owned Copies', 'Year Published'], axis = 1,inplace = False)
#Dropping columns for cummulative graph
cb = books_2022.drop(['Title','Author', 'My Rating', 'Average Rating','Original Publication Year','Date Added'], axis = 1, inplace = False)
#dropping columns for ratings graph
r = books_2022.drop(['Author','Number of Pages', 'Original Publication Year', 'Date Read', 'Date Added'], axis = 1, inplace = False)

So after getting the columns ready, I went to visualise the first graph, which was for the number of books read from a particular year.

#grouping the books by the year they were published in
books_grouped = df_subset.groupby('Original Publication Year').count()

#Bar graph of books published in different years
plt.figure(figsize=(12, 6))
plt.bar(books_grouped.index, books_grouped['Title'], width = 1.0, color=(0.2, 0.4, 0.6, 0.6))
plt.title('Number of Books read from particular year')
plt.xlabel('Publication Year')
plt.ylabel('Number of Books read')
plt.xticks(np.arange(1902,2023,6), rotation = 90)
plt.show()

As we can see, most of the books I read this year were written in the last two decades (Especially from the last decade). I have very few books read from the 1920s, which I deem to be the golden era. I need to change that and hope to read a diverse range of books next year for sure. But, the interesting thing is the books between 1950 to 1965 because I am sure all of them have been related to existentialism and absurdism. My obsession with this philosophy certainly reflects in my reading style.

Next, I decided to see the number of pages read in a single month (Goodread provides this statistic, but I wanted to see something else).

#setting the chart style
sns.set_style("darkgrid")
sns.set_context("talk")

# Convert the Date Read column to a datetime data type
cb["Date Read"] = pd.to_datetime(cb["Date Read"], format = '%Y/%m/%d')
cb["month"] = cb["Date Read"].dt.strftime("%m")
#sorting by month and creating a new column
cb_sorted = cb.sort_values(by="month")
#dropping the date read
cb_sorted = cb_sorted.drop('Date Read', axis = 1)
# Groupby and sum functions 
cb_sorted_combined = cb_sorted.groupby("month")["Number of Pages"].sum()
cb_sorted_combined = cb_sorted_combined.reset_index()

cb_sorted_combined.plot(kind='bar', x='month', y='Number of Pages', 
        figsize=(10, 8), legend=False, color='powderblue', rot=0);
plt.title("Number of pages read in each month", y=1.01, fontsize=20)
plt.ylabel("Number of pages read", labelpad=15)
plt.xlabel("Months", labelpad=15)
plt.xticks(rotation = 90)

If one were to look at this graph, one could pretty much judge what I was doing in my life during different months easily. After seeing the graph, it would be fair to deduce that the months when I read the most were less busy months but turns out it is wrong because those were the busiest months for me, in terms of work and college. My reading habits are very much dependent on my mental health, which oscillates with too much free time on hand.

#cumsum graph
cb_sorted_combined['pages_total'] = cb_sorted_combined['Number of Pages'].cumsum()

cb_sorted_combined.plot(x='month', y='pages_total', kind='line', figsize=(15, 10), legend=False, style='yo-', label="count students graduated running total")
plt.title("Running total of number of pages read", y=1.01, fontsize=20)
plt.ylabel("Number of pages", labelpad=15)
plt.xlabel("Months", labelpad=10)
month_names = ["January", "February", "March", "April", "June", "July", "August", "September", "October", "November", "December"]
plt.xticks(cb_sorted_combined.index, [month_names[i] for i in cb_sorted_combined.index], rotation = 90)

The cumulative number of pages read over the span of the year.

I wanted to see the cumulative number of pages instead of single-month statistics. This gives a visual of how I go from highs and lows of reading frenzies. After each extensive reading period, I give up for a month or two and then get into another extensive reading period. Certainly, not a linear function. I really enjoy this visual as it shows me how my reading speed and reading block work.

Combining my love for reading and statistics was a fun venture indeed, especially with the next visualisation. I decided to see how much my ratings differ from an average user for all my books. The only good metric for this was to see the difference. So, I decided to do this:

My ratings — Goodreads Average ratings

sns.set_style("darkgrid")
sns.set_context("talk")
#subtracting Average ratings from my ratings
r = r.assign(ratings=lambda x: x['My Rating'] - x['Average Rating'])
r_sorted = r.sort_values('ratings')
r_sorted.plot.barh(y='ratings', x= 'Title', width = 1, figsize = (10,18), legend = False)
plt.xlabel("Difference between My rating and Average good reads rating", labelpad=15)
plt.ylabel("Title", labelpad=10)

This certainly gives a lot to think about; most of the time, my rating has been higher than the GoodReads ratings and doesn’t differ more than by 1 point on the positive side. But, whenever I have given ratings less than the average, the difference has exceeded more than 1. For some books, it has gone more than -2. I am inferring that when I hate a book, it usually differs much more than the general opinion of the book (strong opinions).

This was all I did with my data; I didn’t know what else to do. If you have any ideas, do let me know in the comments. I would love to try more analysis to see the patterns. These few patterns have given me good insights into the way I read books and how I rate them.

To recap the whole year of 2022, here is the list of my top books —

The best Fiction book that I read this year —Babel, or the Necessity of Violence by R. F. Kuang

This was easily one of the best fiction books I read, since it captures the idea of a rebellion so well and paints an accurate picture of what imperialist oppression looks like. I saw a few hate reviews for this book from white people on tiktok, so I had to read it and I found the only reason those reviews existed was because this book confronted them with their own racism and existence.

The best Non-Fiction book that I read this year — Educated by Tara Westover.

I don’t usually read memoirs but this one was out of the park. It left me with so many questions about what it means by home. The book talks about how one unlearns and learns the ways of life and their parents. It was hands down the best book I could have read in a turmoil year like 2022.

The best Play that I read this year — Waiting for Godot by Samuel Beckett

Do I really need to even say anything about this? I mean it is Samuel Beckett and his absurd play. It makes you confront the absurdity of the existence and our role in it. It leaves you with a lot to think and not think about.

The worst book that I read this year — It Ends with Us by Colleen Hoover

It brings me no pleasure to shit on the tiktok author of the year, but this book isn’t what people make out it to be. Their is a lot of criticism already for it out there, so I won’t add to it. I am glad though that I read this book, because it has made sure that I wouldn’t pick another book by this author.

So we are at the end; it was a great year in terms of the books I read (around 54 books). I am keeping the target for next year to be around 80 books. Hopefully, I will be able to achieve it! Here is the link to GitHub for the code files required to do the same analysis —

GitHub - Atotmyr/GoodreadsDataAnalysis: Data Analysis of my Goodreads Data from the books read in…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Stay tuned for more stuff related to space, books and philosophy.
Subscribe to my Substack Newsletter — The Grim Reader.

GoodReads Data Analysis using Python: 2022 Books Recap

Analysing my reading statistics for the year 2022 and revealing my favourite book that I read this year!

GitHub - Atotmyr/GoodreadsDataAnalysis: Data Analysis of my Goodreads Data from the books read in…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Atotmyr