How to normalize multilingual categorical data with ChatGPT

data processing
prompting
Published

October 17, 2024

Background

I had a dynamically updated BigQuery table containing a column with categorical data, but the column was in multiple languages and formats. This is how I solved that problem quickly and easily with ChatGPT.

I had a table with a column like this:

Document type
Datasheet
برگه داده
data sheet
datu fitxa
User guide

I exported the top 100 unique cases, and created a CSV file with an additional column. I gave one example for each of the categories I wanted the data to be converted to:

Document type Normalized document type
Datasheet Data sheet
برگه داده
data sheet
datu fitxa
User guide User guide

I then explained to ChatGPT what I was doing, and asked it to fill the missing rows, which it did easily:

Document type Normalized document type
Datasheet Data sheet
برگه داده Data sheet
data sheet Data sheet
datu fitxa Data sheet
User guide User guide

Finally I wrote SQL to add a new column to the existing table with the normalized document type, using the ChatGPT generated table as a lookup table. It worked great, and was quick and easy to do.

I think this is the type of data processing that becomes easy with ChatGPT, and would be very time consuming to do without it.