How to normalize multilingual categorical data with ChatGPT
I had a dynamically updated BigQuery table containing a column with categorical data, but the column was in multiple languages and formats. This is how I solved that problem quickly and easily with ChatGPT.
I had a table with a column like this:
| Document type |
|---|
| Datasheet |
| برگه داده |
| data sheet |
| datu fitxa |
| User guide |
| … |
I exported the top 100 unique cases, and created a CSV file with an additional column. I gave one example for each of the categories I wanted the data to be converted to:
| Document type | Normalized document type |
|---|---|
| Datasheet | Data sheet |
| برگه داده | |
| data sheet | |
| datu fitxa | |
| User guide | User guide |
| … |
I then explained to ChatGPT what I was doing, and asked it to fill the missing rows, which it did easily:
| Document type | Normalized document type |
|---|---|
| Datasheet | Data sheet |
| برگه داده | Data sheet |
| data sheet | Data sheet |
| datu fitxa | Data sheet |
| User guide | User guide |
| … | … |
Finally I wrote SQL to add a new column to the existing table with the normalized document type, using the ChatGPT generated table as a lookup table. It worked great, and was quick and easy to do.
I think this is the type of data processing that becomes easy with ChatGPT, and would be very time consuming to do without it.