Importing products hangs on many products

Description

Trying to import about 70,000 products hangs because of an inefficient sql statements in ImportProduct.java.

Original statement
UPDATE I_Product i
SET I_IsImported='E', I_ErrorMsg=I_ErrorMsg||'ERR=UPC not unique,'
WHERE I_IsImported<>'Y'
AND UPC IN (SELECT UPC FROM I_Product ii WHERE i.AD_Client_ID=ii.AD_Client_ID GROUP BY UPC HAVING COUNT > 1)
this statement uses hours and then terminates with error.
Running in less than a second is

UPDATE I_Product i
SET I_IsImported='E', I_ErrorMsg=I_ErrorMsg||'ERR=UPC not unique,'
WHERE I_IsImported<>'Y'
AND exists (SELECT 1 FROM I_Product ii WHERE i.AD_Client_ID=ii.AD_Client_ID and i.i_product_id <> ii.i_product_id and i.upc = ii.upc)

The same is true for the statement checking for duplicate values in 'value' which runs better as
UPDATE I_Product i
SET I_IsImported='E', I_ErrorMsg=I_ErrorMsg||'ERR=Value not unique,'
WHERE I_IsImported<>'Y'
AND exists (SELECT 1 FROM I_Product ii WHERE i.AD_Client_ID=ii.AD_Client_ID and i.i_product_id <> ii.i_product_id and i.value = ii.value)

Environment

None

Activity

Show:
Carlos Ruiz
December 18, 2019, 2:31 PM

Hi , after revisiting this I think is a matter of having proper indexes for the table I_Product.

When importing a big number of records I guess is necessary to have correct indexes, in the case of the actual code (using IN) some indexes would be required, and in the case of your changes (using EXISTS) other indexes would be required.

Different versions of postgres must resolve different the same query - I think that's why nobody has complained until today about the performance of ImportProduct on this specific statement. Also, same version with different table cost analysis can lead to different results.

I would think is better to leave the code as is (tested) and we can add a suggestion for the required indexes. Anyways, iDempiere is getting far from the old importers in favor of the new CSV importer.

Regards,

Carlos Ruiz

Martin Schönbeck
December 18, 2019, 3:46 PM

Hi Carlos,

I’ve tested with several indexes because that was my first idea, too. But I didn’t get a usable solution. Obviously the version with exists always must be cheaper, because it requires only to find one set while counting the sets requires to find all. So even if you might find an index, exists must be at least as good.

 

Regards

Martin

Martin Schönbeck
December 18, 2019, 3:51 PM

Perhaps we could ask in the group, whether anybody already imported more than 50000 lines with luck.

Regards

Martin

Carlos Ruiz
December 18, 2019, 4:33 PM

Obviously the version with exists always must be cheaper, because it requires only to find one set while counting the sets requires to find all

Is not necessarily always cheaper:

* EXISTS means executing a query for each record, which can be very expensive when there are not proper indexes

* IN means creating a subset one time and then comparing, which can be also expensive, but in some cases it could be less than the EXISTS version

It depends a lot on the number of records, and not just the number of records on the current ad_client_id, but in the whole table, and also the indexes.

Carlos Ruiz
April 22, 2020, 10:00 PM

Tested successfully

Assignee

Martin Schönbeck

Reporter

Martin Schönbeck

Labels

Tested By

None

Components

Fix versions

Affects versions

Priority

Minor
Configure