In this project, I go through the process of Exploratory Data Analysis using MySQL Workbench. This is the second part of this project. In the previous video, I tackled the data cleaning part. Here is a link to part 1 – Data Cleaning and Preparation.
Once the data is clean, we now embark on the fun part, data analysis. Here, we want to analyze the data, breaking it down to bring out insights that will then guide decision making.
This is the cleaned dataset used World_Layoffs from my last video.
This query basically displayed the dataset for review.
SELECT *
FROM world_layoffs.layoffs_staging2;
This is a simple query that gives us the start and end data of our data.
SELECT MIN(date
), MAX(date
)
FROM world_layoffs.layoffs_staging2;
This query displays the max number of people and the max percentage of of people laid off in one go. The results show that 12000 people were laid off and the percentage was 100%. This basically means the entire company went under and laid off all its staff members.
SELECT MAX(total_laid_off), MAX(percentage_laid_off)
FROM world_layoffs.layoffs_staging2;
I was curious to know how many companies went under. The query below gives us the companies that had a percentage lay off of 100%.
SELECT *
FROM world_layoffs.layoffs_staging2
WHERE percentage_laid_off = 1
ORDER BY total_laid_off DESC;
Digging deeper, I ordered the data using funds raised in millions for the companies that went under. Clearly, they had raised a lot of money.
SELECT *
FROM world_layoffs.layoffs_staging2
WHERE percentage_laid_off = 1
ORDER BY funds_raised_millions DESC;
This query shows us the data for the one company – Uber. You can see the data on Uber for various countries. You can do this for any other company.
SELECT *
FROM world_layoffs.layoffs_staging2
WHERE company = ‘Uber’;
This query shows us the number of people laid off by each company, for all countries. The data is ordered in descending order by the total number of people laid off.
SELECT company, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY company
ORDER BY 2 DESC;
This query gives us the sum total of people laid off by industry. The data is the ordered the Total Laid Off in descending order (highest to lowest)
SELECT industry, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY industry
ORDER BY 2 DESC;
This gives us the sum of people laid off for each specific date, ordered by date from most current going down.
SELECT `date
`, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY date
ORDER BY 1 DESC;
This query is similar to the one above but it gives the sum total of people laid off each year ordered from most current year going down.
SELECT YEAR(date
), SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY YEAR(date
)
ORDER BY 1 DESC;
This query gives us the sum of people laid off grouped by the stage the company is in and is ordered by the total laid off highest to lowest.
SELECT stage, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY stage
ORDER BY 2 DESC;
This query gives us the average of percentage laid off. 1 represents 100% – Full closure of the a company from the highest percentage to the lowest.
SELECT company, AVG(percentage_laid_off)
FROM world_layoffs.layoffs_staging2
GROUP BY company
ORDER BY 2 DESC;
This query gives us the number of people laid off per month from the start (2020) to current (2023). Substring is used to extract the month and year from the full date (yyyy-mm-dd).
NOT NULL is used to exclude data where there is no date provided.
SELECT SUBSTRING(date
,1,7) AS Month
, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
WHERE SUBSTRING(date
,1,7) IS NOT NULL
GROUP BY Month
ORDER BY 1 ASC;
In this query, I created a CTE (Common Table Expression) called Rolling Total Sum. This works just like a sub query but is easier to follow.
The CTE is then used to calculate the rolling total of people laid off for each month from the start of the data (2020) to current (2023). A rolling is just a cumulative total from the first month to the last month.
WITH Rolling_Total_Sum AS
(
SELECT SUBSTRING(date
,1,7) AS MONTH
, SUM(total_laid_off) AS Total_All
FROM world_layoffs.layoffs_staging2
WHERE SUBSTRING(date
,1,7) IS NOT NULL
GROUP BY MONTH
ORDER BY 1 ASC
)
SELECT MONTH
, Total_All, SUM(Total_All) OVER (ORDER BY MONTH
) Rolling_Total
FROM Rolling_Total_Sum;
Here is the resulting data showing the year/month, the total of layoffs and the rolling total.
Month 2020-03 | Total_All 9628 | Rolling_Total 9628 |
2020-04 | 26710 | 36338 |
2020-05 | 25804 | 62142 |
2020-06 | 7627 | 69769 |
2020-07 | 7112 | 76881 |
This query gives us the number of people laid off by each company every year ordered by the Total Laid Off highest to lowest.
SELECT company, YEAR(date
), SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
GROUP BY company, YEAR(date
)
ORDER BY 3 DESC;
In this query I used 2 CTE’s to rank companies in terms of total people laid off. The company with the highest number of people laid off is ranked 1 and so on.
WITH Company_Year (Company, Year, Total_Laid_Off) AS
(
SELECT company, YEAR(`date`), SUM(total_laid_off)
`)
FROM world_layoffs.layoffs_staging2
GROUP BY company, YEAR(`date
), Campany_Year_Rank AS
(
SELECT *, DENSE_RANK() OVER (PARTITION BY Year ORDER BY Total_Laid_Off DESC) AS Ranking
FROM Company_Year
WHERE Years IS NOT NULL
)
SELECT *
FROM Campany_Year_Rank
WHERE Ranking <=5;
We want to to see the top 5 companies with the highest number of lay offs for each year.
Company Uber | Year 2020 | Total_Laid_Off 7525 | Ranking 1 |
Booking.com | 2020 | 4375 | 2 |
Groupon | 2020 | 2800 | 3 |
Swiggy | 2020 | 2250 | 4 |
Airbnb | 2020 | 1900 | 5 |
Bytedance | 2021 | 3600 | 1 |
Katerra | 2021 | 2434 | 2 |
Zillow | 2021 | 2000 | 3 |
Instacart | 2021 | 1877 | 4 |
WhiteHat Jr | 2021 | 1800 | 5 |
Meta | 2022 | 11000 | 1 |
Amazon | 2022 | 10150 | 2 |
Cisco | 2022 | 4100 | 3 |
Peloton | 2022 | 4084 | 4 |
Carvana | 2022 | 4000 | 5 |
Philips | 2022 | 4000 | 5 |
Data Summarization
Below we have several queries that summarize the data to highlight different aspects.
Summary by Country
SELECT country, COUNT(company) No_of_Companies, SUM(total_laid_off) Total_Laid_Off
FROM world_layoffs.layoffs_staging2
WHERE Total_Laid_Off IS NOT NULL
GROUP BY country
ORDER BY 3 DESC;
Country United States | No_of_Companies 1027 | Total_Laid_Off 256559 |
India | 126 | 35993 |
Netherlands | 9 | 17220 |
Sweden | 16 | 11264 |
Brazil | 66 | 10391 |
Germany | 55 | 8701 |
Summary by Industry
SELECT industry, COUNT(company) No_of_Companies, SUM(total_laid_off) Total_Laid_Off, SUM(funds_raised_millions) Funds_Raised_in_Millions
FROM world_layoffs.layoffs_staging2
GROUP BY industry
ORDER BY 2 DESC;
Industry Finance | No_of_Companies 239 | Total_Laid_Off 28344 | Funds_Raised_in_Millions 95658 |
Healthcare | 164 | 25953 | 51166 |
Retail | 163 | 43613 | 61688 |
Transportation | 129 | 33748 | 286810 |
Marketing | 123 | 10258 | 20038 |
Summary by Stage
SELECT stage, COUNT(company) No_of_Companies, SUM(total_laid_off) Total_Laid_Off, SUM(funds_raised_millions) Funds_Raised_in_Millions
FROM world_layoffs.layoffs_staging2
GROUP BY stage
ORDER BY 4 DESC;
Stage Post-IPO | No_of_Companies 382 | Total_Laid_Off 204132 | Funds_Raised_in_Millions 1004572 |
Unknown | 315 | 40716 | 137924 |
Series D | 203 | 19225 | 72651 |
Series H | 21 | 7244 | 72574 |
Series E | 107 | 12697 | 58948 |
Summary by City/Location
SELECT location, COUNT(company) No_of_Companies, SUM(total_laid_off) Total_Lay_Offs
FROM world_layoffs.layoffs_staging2
GROUP BY location
ORDER BY 2 DESC;
Location SF Bay Area | No_of_Companies 527 | Total_Lay_Offs 125631 |
New York City | 216 | 29364 |
Boston | 95 | 10785 |
Bengaluru | 77 | 21787 |
Los Angeles | 72 | 6415 |
Seattle | 63 | 34743 |
With the data analysis carried out, this data is now ready for visualization.
1. Exploratory Data Analysis 2. Data Insights 3. Data Summarization