We’ve been using the following PostgreSQL COPY command syntax to load Athena CSVs which have no quoted strings and are tab-delimited.
CSV HEADER QUOTE E
This command is orders of magnitude faster than any other way but
- requires superuser access,
- must be run manually on the database server,
- and the CSV files must be copied to the server beforehand.
Our DBA suggested that we leverage org.postgresql.copy.CopyManager to achieve similar speedy results. So I built a Java wrapper that passes a copy command and an input stream. Here’s the copy command that I’m passing in.
COPY concept FROM STDIN (DELIMITER E’\t’, FORMAT CSV, HEADER, QUOTE E’\b’)
As near as I can tell the arguments are identical to the \COPY command, yet there are many rows in the CONCEPT table with characters in CONCEPT_NAME that cause this STDIN version of the copy command to fail, while the \COPY version above handles them without issue. For example, concept ID 42764118 has hidden extended characters in its CONCEPT_NAME field starting at position 138. This causes the my STDIN version of the copy command to throw an error saying that the CONCEPT_NAME field is longer than its allotted 255 characters. Technically it is if you include the 2 hidden characters - but since they don’t take up any space in a text editor, the concept name appears to comply. The \COPY command not only handles this without error, it retains the hidden characters in the resulting CONCEPT table row. The result of the following query literally has over 255 characters in it yet Postgres somehow ignores them.
select concept_name from concept where concept_id = 42764118
Any insight or direction is appreciated.