In this project, I go through the process of data cleaning or data wrangling using MySQL Workbench.

SQL is an extremely powerful tool for data cleaning/data wrangling. You can delete/populate duplicates, remove unnecessary columns and rows, standardize data and so on. This are necessary steps to take before data analysis for the best results.

The dataset used for this project is provided here. Layoffs Data.

Creating the Database/Schema and Importing the Table

The first step is to create a database or schema called world_lay_offs.

CREATE DATABASE world_layoffs;

After creating the new schema, right click on tables and use the Table Data Import Wizard to import a table called layoffs (provided)

Viewing the table created.

SELECT *
FROM world_layoffs.layoffs;

The second step is to create a DUPLICATE DATABASE and insert data from the original database. The duplicate is called layoffs_staging.

NOTE: As best practice always create a duplicate and avoid making changes to the original database.

Create duplicate database.

View the empty database created.

SELECT *
FROM world_layoffs.layoffs_staging;

Insert data from the original database.

INSERT layoffs_staging
SELECT *
FROM layoffs;

Data Cleaning Project – Steps

These are the 4 steps followed for the data cleaning process.

Step 1 – Removing Duplicates

We need to create a unique identifier. This will be called Row_Num that will help us identify any duplicates.

SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) Row_Num
FROM layoffs_staging;

Create a CTE (Common Table Expression) called duplicate_cte.

WITH duplicate_cte AS
(SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) Row_Num
FROM layoffs_staging
)

Identifying duplicate rows – where their Row_Num is greater than 1.

SELECT *
FROM duplicate_cte
WHERE Row_Num > 1;

Several Duplicated identify. Investigate further to ensure they are true duplicates. Below we explore a Company called Casper. We have a duplicate of this company.

SELECT *
FROM layoffs_staging
WHERE company = ‘Casper’;

We explored 2 options to delete the duplicates.

OPTION ONE
The code below should delete all the duplicate row, where the Row_Num is greater than 1 but it does not work.

WITH duplicate_cte AS
(
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) Row_Num
FROM layoffs_staging
)
DELETE
FROM duplicate_cte
WHERE Row_Num > 1;

OPTION TWO

We have to create another staging table with the NEW row ,Row_Num and call it layoffs_staging2.
The easier way is to right click on the lay_off_staging, Click Copy to Clipboard Click Create Statement. Paste the statement.
This gives you all the rows and their data types. All you have to do is add Row_Num and specify data type as INT.

CREATE TABLE layoffs_staging2 (
company text,
location text,
industry text,
total_laid_off int DEFAULT NULL,
percentage_laid_off text,
date text,
stage text,
country text,
funds_raised_millions int DEFAULT NULL,
row_num int
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

View the new empty table created.

SELECT *
FROM layoffs_staging2;

Add data from layoffs_staging (The first staging table).

INSERT INTO layoffs_staging2
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) Row_Num
FROM layoffs_staging;

Filter using the Row_Num to identify duplicates.

SELECT *
FROM layoffs_staging2
WHERE row_num > 1;

Deleting duplicates from layoffs_staging2.

DELETE
FROM layoffs_staging2
WHERE row_num > 1;

Confirm if the duplicate been deleted.

SELECT *
FROM layoffs_staging2;

Step 2 – Standardizing Data

In this step, we are trimming out leading and trailing white spaces noted on the company column.

SELECT DISTINCT (company)
FROM world_layoffs.layoffs_staging2;

SELECT company, TRIM(company)
FROM world_layoffs.layoffs_staging2;

In this step we note an an error in the industry column. The industry Crypto appears 3 times, 2 of which are spelling errors. This error is corrected below.

SELECT DISTINCT industry
FROM world_layoffs.layoffs_staging2
ORDER BY 1;

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE industry LIKE ‘Crypto%’;

UPDATE world_layoffs.layoffs_staging2
SET industry = ‘Crypto’
WHERE industry LIKE ‘Crypto%’;

SELECT DISTINCT industry
FROM world_layoffs.layoffs_staging2;

In this step we tackle the country column. The country United States in listed twice, United states and United States. (with a period at the end).

SELECT DISTINCT country
FROM world_layoffs.layoffs_staging2
ORDER BY 1;

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE country LIKE ‘%United States%’;

Solution 1 – Update the date column from United States. to United States in the same way as above.

Solution 2 – trim trailing . This will trim the . on United States.

SELECT DISTINCT country, TRIM(TRAILING ‘.’ FROM country)
FROM world_layoffs.layoffs_staging2;

SELECT DISTINCT country
FROM world_layoffs.layoffs_staging2
WHERE country LIKE ‘United States%’;

UPDATE world_layoffs.layoffs_staging2
SET country = TRIM(TRAILING ‘.’ FROM country)
WHERE country LIKE ‘United States%’;

In this step, we are changing the data type of the date column. We need to change it from Text to Date.

SELECT date,
STR_TO_DATE(date,’%m/%d/%Y’)
FROM world_layoffs.layoffs_staging2;

UPDATE world_layoffs.layoffs_staging2
SET date = STR_TO_DATE(date,’%m/%d/%Y’);

SELECT date
FROM world_layoffs.layoffs_staging2;

Having changed the date format in the date column, now we can alter the table to reflect the same.

DESCRIBE world_layoffs.layoffs_staging2;

This shows as all the rows and their data type.

ALTER TABLE world_layoffs.layoffs_staging2
MODIFY COLUMN date DATE ;

DESCRIBE world_layoffs.layoffs_staging2;

Check to see if the column has been modified and yes the data type has been changed from text to date.

Step 3 -Dealing with Nulls and Blank Values


Start by deciding what to do. The decision is to populate the nulls or blanks with the correct value by checking other rows in a similar industry or company starting with Total_laid_off and percentage_laid_off.

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

UPDATE world_layoffs.layoffs_staging2
SET industry = NULL
WHERE industry = ”;

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE industry IS NULL
OR industry = ”;

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE company = ‘Airbnb’;

SELECT *
FROM world_layoffs.layoffs_staging2 t1
JOIN world_layoffs.layoffs_staging2 t2
ON t1.company = t2.company
WHERE (t1.industry IS NULL OR t1.industry = ”)
AND t2.industry IS NOT NULL ;

UPDATE world_layoffs.layoffs_staging2 t1
JOIN world_layoffs.layoffs_staging2 t2
ON t1.company = t2.company
SET t1.industry = t2.industry
WHERE t1.industry IS NULL
AND t2.industry IS NOT NULL;

Step 4 – Deleting Unnecessary Rows


Where total_laid_off and percentage_laid_off is NULL

SELECT *
FROM world_layoffs.layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

DELETE
FROM world_layoffs.layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

DELETE
FROM world_layoffs.layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

Delete the row_num column because it is no longer useful for our analysis.

SELECT *
FROM world_layoffs.layoffs_staging2; DROP COLUMN row_num

Data Cleaning – Conclusion

Having gone through the four data cleaning steps, the data in now ready for analysis. In my next project I go through the data analysis step using MySQL Workbench.

Leave a Reply

Your email address will not be published. Required fields are marked *

LinkedIn
LinkedIn
Follow by Email
RSS