How to normalize multilingual categorical data with ChatGPT
I had a dynamically updated BigQuery table containing a column with categorical data, but the column was in multiple languages and formats. This is how I solved that problem quickly and easily with ChatGPT.
I had a table with a column like this:
Document type |
---|
Datasheet |
برگه داده |
data sheet |
datu fitxa |
User guide |
… |
I exported the top 100 unique cases, and created a CSV file with an additional column. I gave one example for each of the categories I wanted the data to be converted to:
Document type | Normalized document type |
---|---|
Datasheet | Data sheet |
برگه داده | |
data sheet | |
datu fitxa | |
User guide | User guide |
… |
I then explained to ChatGPT what I was doing, and asked it to fill the missing rows, which it did easily:
Document type | Normalized document type |
---|---|
Datasheet | Data sheet |
برگه داده | Data sheet |
data sheet | Data sheet |
datu fitxa | Data sheet |
User guide | User guide |
… | … |
Finally I wrote SQL to add a new column to the existing table with the normalized document type, using the ChatGPT generated table as a lookup table. It worked great, and was quick and easy to do.
I think this is the type of data processing that becomes easy with ChatGPT, and would be very time consuming to do without it.