In this post we will be walking through the process of converting a messy Excel worksheet into tidy data. According to Hadley Wickham's excellent book, R For Data Science, tidy data follows three main principles:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

The initial work of organizing the data will pay dividends down the road as your data will be uniform and easier to work with.

For this tutorial we will be using the farm sector balance sheet provided by the United States Department of Agriculture (USDA).

Imgur

The USDA Excel shows a time series from 2014-2020 (with forecasted values for 2021). It is in a wide format with the variables (items in column 'A') representing rows instead of columns. Our goal will be to transform the dataframe into the below shape. We will have 5 columns:

Year
Balance item
Amount
Forecast (a boolean column indicating true if the amount is a forecast or historical data)
Report date

The last two columns (forecast and report date) may seem a little unnecessary. I included them because they will potentially be helpful keys if we were to include the report into a larger database. For instance, if we wanted to keep an archival database of all the farm sector balance sheets we could quickly identify observations with their report data. Additionally, I really like to indicate if the value is a forecast as it can lead into some interesting insights as to how their forecast changes over time and how it ends up performing to actual data.

The first step is to load our packages and then the Excel data into a dataframe. All we need is Numpy and Pandas. We will use Pandas' read_excel() function to load the dataset. We'll pull data starting in row 3 of the Excel (we use header=2 here because read_excel() is zero-index while the spreadsheet is indexed at 1) and we'll read just for the first table (29 rows). Immediately after loading the data we will pull the date of the report into a variable that will be helpful once we create the report date column.

import numpy as np
import pandas as pd

farm_raw = pd.read_excel(file_path, sheet_name=0, header=2, nrows=25)

report_date = farm_raw.columns[4]

As you tell the above dataset is pretty messy. The first thing we will want to do is make the first row (remember it's zero-indexed) the column names. After that we can drop rows and columns that have NaNs in them as well as the two columns that contain year-over-year percent change (we will be creating a separate dataframe in a different blog that just measures this).

farm_raw.columns = farm_raw.iloc[1]

# Remove rows that have NaNs in them
farm_raw = farm_raw = farm_raw.dropna(axis=0, how='all') 

# Remove the one column that is an NaN
## The below code slices the DataFrame to include all columns do not include a null name
farm_raw = farm_raw.loc[:, farm_raw.columns.notnull()]

farm_raw = farm_raw.drop(columns=['2019 - 20', '2020 - 21F'])

Let's set the first column as an index so it is a little easier to work with. Also, we can go ahead and delete the first 3 rows of the dataframe as they don't contain useful information.

## We need to make the columns a list and then we can select the first column
farm_raw = farm_raw.set_index(list(farm_raw.columns[0]))

farm_raw = farm_raw.drop(index=farm_raw.index[:4])

farm_raw.index

Index(['a.  Cash receipts ', '      Crops 1/', '      Animals and products',
       'b.  Federal Government direct farm program payments 2/',
       'c. Cash farm-related income 3/', 'd. Gross cash income (a+b+c)',
       'e. Cash expenses  4/, 5/', 'f. Net cash income (d-e)',
       'Farm income statement', 'g. Gross cash income (a+b+c)',
       'h. Nonmoney income 6/ ', 'i. Value of inventory adjustment ',
       'j. Total gross income (g+h+i)', 'k. Total expenses',
       'l. Net farm income (j-k)'],
      dtype='object', name=' ')

As we can see, the index items are messy with various letters preceeding the names and footnotes still present (e.g. '1/'). We will use a series of pandas string methods to clean up the that text column.

farm_raw.index = farm_raw.index.str.lower()

# Using regular expressions to remove the lower-case row labels
# E.g. the 'a.' in 'a. Cash receipts'
## Since there is no str.remove function we will just replace the pattern we want to drop with an empty string
farm_raw.index = farm_raw.index.str.replace(r'[a-z]\.', '')

# Remove all the parentheses and the chartacters within them
# E.g. the '(a+b+c)' in 'g. Gross cash income (a+b+c)'
farm_raw.index = farm_raw.index.str.replace(r'\(([^\)]+)\)', '')

# Remove all the footnote labels
# E.g. the '2/' in 'Federal Government direct farm program payments'
farm_raw.index = farm_raw.index.str.replace(r'[1-9]/', '')

# Remove all commas
farm_raw.index = farm_raw.index.str.replace(',', '')

# Remove all the white space before and after the strong
farm_raw.index = farm_raw.index.str.strip()

# Replace spaces with underscores
farm_raw.index = farm_raw.index.str.replace(' ', '_')

With the columns cleaned up we can put the data into long format. The first thing we will do is transpose our dataframe (make the columns the rows and the rows the columns).

## This will allow us to melt the data frame (next step) easier
farm_raw = farm_raw.transpose().reset_index()

Next we will melt the dataframe. This powerful function (pd.melt()) makes our current wide dataframe into a long dataframe. The documentation describes it as this:

"one or more columns are identifier variables (for our case the year column), while all other columns, considered measured variables are 'unpivoted' to the row axis, leaving just two non-identifier columns, 'variable' (measurement, e.g. 'cash_receipts') and 'value' (the balance values... the numbers)."

farm_raw = pd.melt(frame=farm_raw, id_vars=[1])

As you can see, our dataframe now consists of just three columns. We could have kept it unmelted and it would have technically been tidy. Each row was an observation and each column was a separate variable. However, we want to add two more columns for each observation - if it was a forecast and the date of the report it was associated with.

First let's rename those columns that we already have.

farm_raw.columns = ['year', 'balance_item','value']

Next let's add a boolean column (true or false) to show whether the observation was a forecast or not. The Excel indicated forecasts by adding an "F" at the end of the date (2021F). We will use Numpy's where function to indicate False for columns that just have 4 numbers and True for everything else (columns with an "F" for forecast).

farm_raw['forecast'] = np.where(farm_raw['year'].str.contains('0000'), False, True)

Now we can add a column showing the report date (remember we pulled this earlier from the spreadsheet and saved it into a variable report_date).

farm_raw['report_date'] = report_date

We're almost there! Lets dig a little deeper into our columns (variables) and see what data types they are.

farm_raw.dtypes

year                    object
balance_item            object
value                   object
forecast                  bool
report_date     datetime64[ns]
dtype: object

Good thing we checked! Both our year column and value column are objects when we would want them to be integers and floats, respectively. This makes sense if you remember our original data set (especially since we never manually assigned the columns data types - a good habit I should admittedly get better with). There was likely some strings and floasts in the columns that the year and balance_item columns are derived from so they automatically got converted into objects. Luckily this is an easy fix.

For the year column we have to get rid of the "F" for forecasted values, make sure its only 4 digits, and then convert it to an integer type.

farm_raw['year'] = farm_raw['year'].astype(str)

# Remove all non-digits (D). This is meant to drop the 'F'
farm_raw['year'] = farm_raw['year'].str.replace('\D','')

# Only include the 4 numbers for a year
farm_raw['year'] = farm_raw['year'].str.slice(stop=4)

# Convert the column to an integer
farm_raw['year'] = farm_raw['year'].astype(int)

The value column is much easier. We can just convert it to a float using the above .astype() function.

farm_raw.value = farm_raw.value.astype(float)

farm_raw.dtypes

year                     int32
balance_item            object
value                  float64
forecast                  bool
report_date     datetime64[ns]
dtype: object

Much better! All our values are now datatypes we would expect. And with that, we've cleaned the data! There's still much more we can do. We can easily navigate and filter this dataframe with Pandas, add on previous reports from USDA, and create graphics. In the future I'll have a blog post that will show how we can easily create a corresponding dataframe that looks represents the data in year-over-year percent change - a valuable way to look at economic data.

As a final step let's make the farm_raw into just farm and then take a look at our clean and tidy dataset!

farm = farm_raw

1		2014	2015	2016.0	2017.0	2018.0	2019.0	2020.0	2021F
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1		2014	2015	2016.000000	2017.000000	2018.000000	2019.000000	2020.000000	2021F
3	NaN	NaN	$ billion	NaN	NaN	NaN	NaN	NaN	NaN
4	Cash income statement		NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	a. Cash receipts	423.971	377.432	358.481924	370.427294	371.182097	367.079638	357.160627	421.508
6	Crops 1/	211.681	187.916	195.751261	194.867576	194.853082	191.630787	192.162673	230.054
7	Animals and products	212.29	189.516	162.730663	175.559718	176.329015	175.448851	164.997954	191.454
8	b. Federal Government direct farm program pay...	9.76684	10.8045	12.979677	11.531611	13.669010	22.447200	45.687724	28.0339
9	c. Cash farm-related income 3/	36.5654	34.3769	27.891612	31.206105	29.125531	34.723812	34.314071	36.8362
10	d. Gross cash income (a+b+c)	470.303	422.613	399.353213	413.165009	413.976638	424.250650	437.162422	486.378

	1	cash_receipts	crops	animals_and_products	federal_government_direct_farm_program_payments	cash_farm-related_income	gross_cash_income	cash_expenses	net_cash_income	farm_income_statement	gross_cash_income	nonmoney_income	value_of_inventory_adjustment	total_gross_income	total_expenses	net_farm_income
0	2014	423.971	211.681	212.29	9.76684	36.5654	470.303	338.998	131.306	NaN	470.303	16.8919	-3.907	483.288	391.05	92.238
1	2015	377.432	187.916	189.516	10.8045	34.3769	422.613	315.829	106.785	NaN	422.613	17.7622	0.419605	440.795	359.131	81.664
2	2016	358.482	195.751	162.731	12.9797	27.8916	399.353	303.784	95.5696	NaN	399.353	17.1479	-4.24842	412.253	349.938	62.3145
3	2017	370.427	194.868	175.56	11.5316	31.2061	413.165	311.892	101.273	NaN	413.165	18.2841	-6.04901	425.4	350.285	75.1147
4	2018	371.182	194.853	176.329	13.669	29.1255	413.977	311.398	102.579	NaN	413.977	19.1378	-8.23461	424.88	343.815	81.0648

	1		value
0	2014	cash_receipts	423.971
1	2015	cash_receipts	377.432
2	2016	cash_receipts	358.482
3	2017	cash_receipts	370.427
4	2018	cash_receipts	371.182
5	2019	cash_receipts	367.08
6	2020	cash_receipts	357.161
7	2021F	cash_receipts	421.508
8	2014	crops	211.681
9	2015	crops	187.916
10	2016	crops	195.751
11	2017	crops	194.868
12	2018	crops	194.853
13	2019	crops	191.631
14	2020	crops	192.163
15	2021F	crops	230.054
16	2014	animals_and_products	212.29
17	2015	animals_and_products	189.516
18	2016	animals_and_products	162.731
19	2017	animals_and_products	175.56

	year	balance_item	value	forecast	report_date
0	2014	cash_receipts	423.970804	False	2021-09-02
1	2015	cash_receipts	377.432066	False	2021-09-02
2	2016	cash_receipts	358.481924	False	2021-09-02
3	2017	cash_receipts	370.427294	False	2021-09-02
4	2018	cash_receipts	371.182097	False	2021-09-02
5	2019	cash_receipts	367.079638	False	2021-09-02
6	2020	cash_receipts	357.160627	False	2021-09-02
7	2021	cash_receipts	421.508110	True	2021-09-02
8	2014	crops	211.680565	False	2021-09-02
9	2015	crops	187.916477	False	2021-09-02
10	2016	crops	195.751261	False	2021-09-02
11	2017	crops	194.867576	False	2021-09-02
12	2018	crops	194.853082	False	2021-09-02
13	2019	crops	191.630787	False	2021-09-02
14	2020	crops	192.162673	False	2021-09-02
15	2021	crops	230.053695	True	2021-09-02
16	2014	animals_and_products	212.290239	False	2021-09-02
17	2015	animals_and_products	189.515589	False	2021-09-02
18	2016	animals_and_products	162.730663	False	2021-09-02
19	2017	animals_and_products	175.559718	False	2021-09-02

	United States	Unnamed: 1	Unnamed: 2	Data as of:	2021-09-02 00:00:00	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Change	NaN
1		2014	2015	2016.0	2017.0	2018.0	2019.0	2020.0	2021F	NaN	2019 - 20	2020 - 21F
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	$ billion	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Percent	Percent
4	Cash income statement		NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN