Tidy Data in Python
A tutorial of converting a messy Excel spreadsheet into a tidy long-formatted Pandas DataFrame.
In this post we will be walking through the process of converting a messy Excel worksheet into tidy data. According to Hadley Wickham's excellent book, R For Data Science, tidy data follows three main principles:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
The initial work of organizing the data will pay dividends down the road as your data will be uniform and easier to work with.
For this tutorial we will be using the farm sector balance sheet provided by the United States Department of Agriculture (USDA).

The USDA Excel shows a time series from 2014-2020 (with forecasted values for 2021). It is in a wide format with the variables (items in column 'A') representing rows instead of columns. Our goal will be to transform the dataframe into the below shape. We will have 5 columns:
- Year
- Balance item
- Amount
- Forecast (a boolean column indicating true if the amount is a forecast or historical data)
- Report date
The last two columns (forecast and report date) may seem a little unnecessary. I included them because they will potentially be helpful keys if we were to include the report into a larger database. For instance, if we wanted to keep an archival database of all the farm sector balance sheets we could quickly identify observations with their report data. Additionally, I really like to indicate if the value is a forecast as it can lead into some interesting insights as to how their forecast changes over time and how it ends up performing to actual data.
The first step is to load our packages and then the Excel data into a dataframe. All we need is Numpy and Pandas. We will use Pandas' read_excel() function to load the dataset. We'll pull data starting in row 3 of the Excel (we use header=2 here because read_excel() is zero-index while the spreadsheet is indexed at 1) and we'll read just for the first table (29 rows). Immediately after loading the data we will pull the date of the report into a variable that will be helpful once we create the report date column.
import numpy as np
import pandas as pd
farm_raw = pd.read_excel(file_path, sheet_name=0, header=2, nrows=25)
report_date = farm_raw.columns[4]
As you tell the above dataset is pretty messy. The first thing we will want to do is make the first row (remember it's zero-indexed) the column names. After that we can drop rows and columns that have NaNs in them as well as the two columns that contain year-over-year percent change (we will be creating a separate dataframe in a different blog that just measures this).
farm_raw.columns = farm_raw.iloc[1]
# Remove rows that have NaNs in them
farm_raw = farm_raw = farm_raw.dropna(axis=0, how='all')
# Remove the one column that is an NaN
## The below code slices the DataFrame to include all columns do not include a null name
farm_raw = farm_raw.loc[:, farm_raw.columns.notnull()]
farm_raw = farm_raw.drop(columns=['2019 - 20', '2020 - 21F'])
Let's set the first column as an index so it is a little easier to work with. Also, we can go ahead and delete the first 3 rows of the dataframe as they don't contain useful information.
## We need to make the columns a list and then we can select the first column
farm_raw = farm_raw.set_index(list(farm_raw.columns[0]))
farm_raw = farm_raw.drop(index=farm_raw.index[:4])
farm_raw.index
As we can see, the index items are messy with various letters preceeding the names and footnotes still present (e.g. '1/'). We will use a series of pandas string methods to clean up the that text column.
farm_raw.index = farm_raw.index.str.lower()
# Using regular expressions to remove the lower-case row labels
# E.g. the 'a.' in 'a. Cash receipts'
## Since there is no str.remove function we will just replace the pattern we want to drop with an empty string
farm_raw.index = farm_raw.index.str.replace(r'[a-z]\.', '')
# Remove all the parentheses and the chartacters within them
# E.g. the '(a+b+c)' in 'g. Gross cash income (a+b+c)'
farm_raw.index = farm_raw.index.str.replace(r'\(([^\)]+)\)', '')
# Remove all the footnote labels
# E.g. the '2/' in 'Federal Government direct farm program payments'
farm_raw.index = farm_raw.index.str.replace(r'[1-9]/', '')
# Remove all commas
farm_raw.index = farm_raw.index.str.replace(',', '')
# Remove all the white space before and after the strong
farm_raw.index = farm_raw.index.str.strip()
# Replace spaces with underscores
farm_raw.index = farm_raw.index.str.replace(' ', '_')
With the columns cleaned up we can put the data into long format. The first thing we will do is transpose our dataframe (make the columns the rows and the rows the columns).
## This will allow us to melt the data frame (next step) easier
farm_raw = farm_raw.transpose().reset_index()
Next we will melt the dataframe. This powerful function (pd.melt()) makes our current wide dataframe into a long dataframe. The documentation describes it as this:
"one or more columns are identifier variables (for our case the year column), while all other columns, considered measured variables are 'unpivoted' to the row axis, leaving just two non-identifier columns, 'variable' (measurement, e.g. 'cash_receipts') and 'value' (the balance values... the numbers)."
farm_raw = pd.melt(frame=farm_raw, id_vars=[1])
As you can see, our dataframe now consists of just three columns. We could have kept it unmelted and it would have technically been tidy. Each row was an observation and each column was a separate variable. However, we want to add two more columns for each observation - if it was a forecast and the date of the report it was associated with.
First let's rename those columns that we already have.
farm_raw.columns = ['year', 'balance_item','value']
Next let's add a boolean column (true or false) to show whether the observation was a forecast or not. The Excel indicated forecasts by adding an "F" at the end of the date (2021F). We will use Numpy's where function to indicate False for columns that just have 4 numbers and True for everything else (columns with an "F" for forecast).
farm_raw['forecast'] = np.where(farm_raw['year'].str.contains('0000'), False, True)
Now we can add a column showing the report date (remember we pulled this earlier from the spreadsheet and saved it into a variable report_date).
farm_raw['report_date'] = report_date
We're almost there! Lets dig a little deeper into our columns (variables) and see what data types they are.
farm_raw.dtypes
Good thing we checked! Both our year column and value column are objects when we would want them to be integers and floats, respectively. This makes sense if you remember our original data set (especially since we never manually assigned the columns data types - a good habit I should admittedly get better with). There was likely some strings and floasts in the columns that the year and balance_item columns are derived from so they automatically got converted into objects. Luckily this is an easy fix.
For the year column we have to get rid of the "F" for forecasted values, make sure its only 4 digits, and then convert it to an integer type.
farm_raw['year'] = farm_raw['year'].astype(str)
# Remove all non-digits (D). This is meant to drop the 'F'
farm_raw['year'] = farm_raw['year'].str.replace('\D','')
# Only include the 4 numbers for a year
farm_raw['year'] = farm_raw['year'].str.slice(stop=4)
# Convert the column to an integer
farm_raw['year'] = farm_raw['year'].astype(int)
The value column is much easier. We can just convert it to a float using the above .astype() function.
farm_raw.value = farm_raw.value.astype(float)
farm_raw.dtypes
Much better! All our values are now datatypes we would expect. And with that, we've cleaned the data! There's still much more we can do. We can easily navigate and filter this dataframe with Pandas, add on previous reports from USDA, and create graphics. In the future I'll have a blog post that will show how we can easily create a corresponding dataframe that looks represents the data in year-over-year percent change - a valuable way to look at economic data.
As a final step let's make the farm_raw into just farm and then take a look at our clean and tidy dataset!
farm = farm_raw