How to delete words from column name using Python
While analyzing large datasets, we get common repetitive words in multiple columns. The names become irrelevant to the analysis when comparing, therefore, we try to keep the uniqueness of the column name.
There are multiple ways to approach any problem, this is no different. One of the most common ways people go about this issue is replacing the column name discreetly one by one. Let’s say we have a dataset on mobile phones, and the dataset has the first column as Manufacture_name. We can easily replace that with Name or Company. However, if there are multiple columns with Manufacture in it, this is how to go about it.
- Get essential libraries
import pandas as pd #Loading packages
2. Import the dataset; Here we are assuming a CSV file
data = pd.read_csv('filepath\\filename')
#filename should also contain type of file like .csv or .txt#For large files use
data = pd.read_csv('filepath\\filename', low_memory = False)
3. (Optional) Check the dataset and column names
data.head()
4. Creating a list of custom words we want to remove
Instead of running the same program multiple times, to remove different words, we create a list of words which can we used to run only once. For example, our data set has two common words say Manufacture and Supplier with column names Manufacture_name, Manufacture_id, Manufacture_Location, Supplier_count, Supplier_Amount, and so on.
words = ['Manufacture_' , 'Supplier_'] #Python is case sensitive
#Each words needs to exactly like the column name
5. Creating a new list for modified column names
c = list()
c = data.columns.tolist()
for i in range(len(c)): #Loop every column
for word in words: #Loop for every word
c[i] = c[i].replace(word,'')
6. Modifying the existing column names with new names
data.columns = c
7. (Optional) Check the dataset again for updated column names
data.head()
8. Export dataset for future use; Assuming CSV file
data.to_csv('filepath\\customname.csv',index=False)
Using this code before analysis will help in finding the uniqueness in column names so all the important and relevant information is available and all the common and unimportant names are removed.