One of the first step a data analyst takes is removing duplicates from their dataset. Duplicates can read to inaccurate data analysis and therefore impact decisions made from the analyzed data.

Here are three simple ways that I use to remove duplicates from my dataset:

Remove Duplicates

Once you have your spreadsheet open, ensure that you have selected any sell in the dataset. Navigate to the toolbar and click on the Data tab. Navigate to the Data Tools section and click on the remove duplicates tab.

Using the Function Unique()

On an empty cell, type =. This allows you to type in the function. Type unique and hit your tab key. Select the table containing your dataset and press enter.

You will now have another table with unique values only.

The biggest advantage of using this method is the data will automatically be updated every time new data is added to the original table. You will not have to repeat this process again.

Removing duplicates in Power Query

Power Query is a useful and powerful tool used for data cleaning an manipulation.

With your spreadsheet open, navigate to the toolbar section and click on data. Got to “Get and transform Data” and select the table range. This will load your data on Power Query.

In Power Query, navigate to the tool bar and click on the home tab. Find the Remove Rows button and click on remove duplicates.

Deleting data in Power Query is awesome because you do not have to repeat the process. Once data is added to the original dataset, all you have to do is refresh your cleaned dataset and all changes will be updated.

You can also right click on the Title and find the remove duplicates button.

Caution

Power Query is case sensitive. Ensure that all your data is in the same case to ovoid mistakes. Example: The name Irene and irene is treated as 2 unique names in Power Query simply because of the case!

Watch out for that!

Leave a Reply

Your email address will not be published. Required fields are marked *

LinkedIn
LinkedIn
Follow by Email
RSS