Easily Transform Data Using Pandas and Regular Expression
Introduction
Let’s Easily transform this dataset using pandas and regular expression in line with the problem statement.
In the extraction phase of this project I walked you through on how to scrape the ebay.co.uk website. You can find the link 👉 Extraction phase
In this transformation phase, you will understand all the thought process that went into transforming this 3 field dataset into 11 fields dataset.
To properly understand where you are going, it is important you remember what the problem statement is.
Problem Statement
The CEO of Global Sygnature, a company that sells new laptops in the city of Port Harcourt has gotten a lot of requests for fairly used quality laptops. He has decided to go into this line of business to expand the services they render.
He has met Primedsoft, a Business Intelligence Agency to help him extract data about laptops from ebay. He said he would like to see metrics such as:
- Average Price of laptops
- Most expensive laptop listed
- Least expensive laptop listed
- Average prices of major brands
- The top most popular brands
- Most listed laptops by their OS
- Prices of laptops base on their current state
- Average price of laptops by drive type, disk size and RAM size.
- Any other insight that can be gotten from the data that will help in decision making.
These metrics will form the core of the transformation phase. You want to transform this dataset into fields and data types that will help get these metrics.
Another important skill here is a proper understanding of the business domain. The importance of Domain knowledge cannot be overemphasized.
For instance, a proper understanding of the computer hardware will show that accessories such as RAM and processor are key data that affect the prices of laptops. Most buyers also consider brands before making up their minds on the system to go for.
A quick look at the dataset shows that there are so many data points in the “Lap detail” field. These data points are what we will aim to extract and form new fields.
Importing Numpy and Pandas Libraries and Regular REGEX Module
We will be using the pandas library and regular expression module in python to do this transformation. First import the necessary libraries and modules needed for this transformation.
The last setting in the code snippet is to tell pandas to display the entire width of each column. At this stage I expect that you have working knowledge of pandas. So I went ahead to import the csv file where the dataset is and I used the .head() method to inspect the first 5 columns.
Creating New Columns From Existing Columns With Regular Expression
Here, I wanted to extract the drives from the “Lab detail” column. So I used the extract method in regex to get all instances of any of those numbers that have GB attached to it. This will return the first occurrence of it.
Now, using the same method I extract and create “RAM”, “RAM size” and “Laptop OS” columns.
However, there are some columns that require a switch statement. To achieve this with pandas, I used the numpy where.() method.
So with the code snippet above I created “Lap OS”, “Drive Type”, “Processor Type”, and “Laptop Brand”. This completes all the necessary columns we need.
Here I renamed all the fields to snake case naming convention. This will enable consistency.
Then replace all instances of the comma with an empty string.
The reason is since we are saving it in a csv format, it will take every occurrence of the comma as a new column, which will in turn create more columns than declared in the variables.
Conclusion
We have been able to transform a single column into several other columns. Let’s save this dataframe in a csv file for the next phase of the project, which is to load it into a postgres database. Now this project is gradually looking like what we want.
One Comment