How to normalize multilingual categorical data with ChatGPT

data processing

prompting

Published

October 17, 2024

Background

I had a dynamically updated BigQuery table containing a column with categorical data, but the column was in multiple languages and formats. This is how I solved that problem quickly and easily with ChatGPT.

I had a table with a column like this:

Document type
Datasheet
برگه داده
data sheet
datu fitxa
User guide
…

I exported the top 100 unique cases, and created a CSV file with an additional column. I gave one example for each of the categories I wanted the data to be converted to:

Document type	Normalized document type
Datasheet	Data sheet
برگه داده
data sheet
datu fitxa
User guide	User guide
…

I then explained to ChatGPT what I was doing, and asked it to fill the missing rows, which it did easily:

Document type	Normalized document type
Datasheet	Data sheet
برگه داده	Data sheet
data sheet	Data sheet
datu fitxa	Data sheet
User guide	User guide
…	…

Finally I wrote SQL to add a new column to the existing table with the normalized document type, using the ChatGPT generated table as a lookup table. It worked great, and was quick and easy to do.

I think this is the type of data processing that becomes easy with ChatGPT, and would be very time consuming to do without it.