PostgreSQL COPY command using STDIN chokes on weird characters in CONCEPT.CONCEPT_NAME

Jeff_Stroup · October 13, 2022, 9:49pm

Hi,

We’ve been using the following PostgreSQL COPY command syntax to load Athena CSVs which have no quoted strings and are tab-delimited.

\COPY CONCEPT FROM 'CONCEPT.csv' WITH DELIMITER E '\t' CSV HEADER QUOTE E '\b' ;

This command is orders of magnitude faster than any other way but

requires superuser access,
must be run manually on the database server,
and the CSV files must be copied to the server beforehand.

Our DBA suggested that we leverage org.postgresql.copy.CopyManager to achieve similar speedy results. So I built a Java wrapper that passes a copy command and an input stream. Here’s the copy command that I’m passing in.

COPY concept FROM STDIN (DELIMITER E’\t’, FORMAT CSV, HEADER, QUOTE E’\b’)

As near as I can tell the arguments are identical to the \COPY command, yet there are many rows in the CONCEPT table with characters in CONCEPT_NAME that cause this STDIN version of the copy command to fail, while the \COPY version above handles them without issue. For example, concept ID 42764118 has hidden extended characters in its CONCEPT_NAME field starting at position 138. This causes the my STDIN version of the copy command to throw an error saying that the CONCEPT_NAME field is longer than its allotted 255 characters. Technically it is if you include the 2 hidden characters - but since they don’t take up any space in a text editor, the concept name appears to comply. The \COPY command not only handles this without error, it retains the hidden characters in the resulting CONCEPT table row. The result of the following query literally has over 255 characters in it yet Postgres somehow ignores them.

select concept_name from concept where concept_id = 42764118

Any insight or direction is appreciated.

Thanks,

Jeff

Eduard_Korchmar · October 14, 2022, 7:07pm

I’m not very familiar with how Java encodes text data, but maybe explicit conversion to UTF8 should help?

Jeff_Stroup · October 18, 2022, 3:21pm

Eduard,
Changing this…

BufferedReader inStream = new BufferedReader(new FileReader(sourcePath));

… to this

			BufferedReader inStream = new BufferedReader(
			           new InputStreamReader(
			                      new FileInputStream(sourcePath), "UTF8"));

did actually resolve the issue with accurately migrating special language symbols, but I’m still working on getting CopyManager.copyIn to not choke on the weird hidden characters. Someone, somewhere knows why the /COPY version from PSQL handles it without complaint.