澡堂子: December 2011

IT SEEMS THAT ,WHEN DEAL WITH DUPLICATE ROW INTO A SET TABLE WITH UPI,
TERADATA GOT SOME DIFFERENCE BETWEEN MANUAL AND THE ACTUAL.

WE KNOW THAT WHEN DEFINE A SET TABLE WITH UPI, TERADATA WILL DO A UPI
CHECK AND IGNORE SET TABLE CHECK(DUPLICATE ROW CHECK).

AND TERADATA ALSO DID IT WHEN WE INSERT INTO A SET-UPI TABLE.

BUT THING GOES DIFFERENT WHEN WE DO A INSERT SELECT

TERADATA WILL NOT SHOW ANY ERROR MESSAGE BUT SIMPLY GET A SUCCESS.
OF COURSE, THE DUPLICATE ROW WILL NOT APPEAR IN THE TARGET TABLE

THAT REALLY CONFUSED ME.

I SEARCHED THE INTERNET AND FOUND SOME SIMILAR PROBLEM

http://forums.teradata.com/forum/database/set-with-upi

http://carlosal.wordpress.com/2009/02/16/tablas-set-y-multiset-en-teradata/

按照教科书的说法，TERADATA在处理定义有UPI的SET表时，会用UPI的唯一性检查替换掉SET表的重复行检查。
(参见Teradata Factory)
但是事实上并非完全如此，至少在某些情况下，比如说INSERT SELECT
在直接执行INSERT时，对于UPI的SET表，如果insert重复行，会提示有UPI重复错误。
但是如果在INSERT SELECT中，有时候不会有任何错误发生，Teradata会默默地去掉重复行，只插入非重复的部分
而不会返回任何错误
即使有教科书，有时候也挺让人迷惑的

UPDATE 2012/6/26-------------------------------------------------------
i found the answer from the Teradata ApplicationDesignand Development manual
they said that an insert-select automatically discards duplicate rows with out any notification. even if the target table has a UPI on it
To trap this we create a multi-set table with a UPI on it.

THERE IS A NOTABLE PERFORMANCE IMPACT WHEN TERADATA DEAL WITH "NOT IN" CLAUSE
WITHOUT NOT NULL PROPERTY SPECIFIED ON THE COLUMN

HERE IS THE SQL SCRIPT AND explain IN TERDATA 13:

explain SELECT
BOT_CD
,KAISYA_CD
,TKS_CD
,LOCA_CD
,COL_NO
,NHIN_BSHO_CD
,MAC_CD
,SHO_CD
,JSC_SHO_CD
,JSC_SHO_CD_EDA
,HEN_ZAIKO_SU
,KINAI_ZAIKO_SU
,HOTCOLD_FLG
FROM
cvdtst.S_KINAIZAIKO_COL1221OU/* S_KINAIZAIKO_COL */
WHERE
(
BOT_CD
,KAISYA_CD
,TKS_CD
,LOCA_CD
,NHIN_BSHO_CD
,SHO_CD
,HOTCOLD_FLG
)
NOT IN
(
SELECT
BOT_CD
,KAISYA_CD
,TKS_CD
,LOCA_CD
,NHIN_BSHO_CD
,SHO_CD
,BUNRUI_CD
FROM
cvdtst.S_SLIP1_SHO1221OU /* S_SLIP1_SHO */
WHERE
SURYOU > 0
)
;

1) First, we lock a distinct cvdtst."pseudo table" for read on a
RowHash to prevent global deadlock for cvdtst.S_SLIP1_SHO1221OU.
2) Next, we lock a distinct cvdtst."pseudo table" for read on a
RowHash to prevent global deadlock for
cvdtst.S_KINAIZAIKO_COL1221OU.
3) We lock cvdtst.S_SLIP1_SHO1221OU for read, and we lock
cvdtst.S_KINAIZAIKO_COL1221OU for read.
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from cvdtst.S_SLIP1_SHO1221OU
by way of an all-rows scan with a condition of (
"cvdtst.S_SLIP1_SHO1221OU.SURYOU > 0.0") into Spool 7
(all_amps), which is redistributed by the hash code of (
cvdtst.S_SLIP1_SHO1221OU.BUNRUI_CD,
cvdtst.S_SLIP1_SHO1221OU.SHO_CD,
cvdtst.S_SLIP1_SHO1221OU.NHIN_BSHO_CD,
cvdtst.S_SLIP1_SHO1221OU.LOCA_CD,
cvdtst.S_SLIP1_SHO1221OU.TKS_CD,
cvdtst.S_SLIP1_SHO1221OU.KAISYA_CD,
cvdtst.S_SLIP1_SHO1221OU.BOT_CD) to all AMPs. Then we do a
SORT to order Spool 7 by row hash and the sort key in spool
field1 eliminating duplicate rows. The size of Spool 7 is
estimated with no confidence to be 11,078,203 rows (
1,074,585,691 bytes). The estimated time for this step is
7.68 seconds.
2) We do an all-AMPs SUM step to aggregate from
cvdtst.S_KINAIZAIKO_COL1221OU by way of an all-rows scan with
no residual conditions. Aggregate Intermediate Results are
computed globally, then placed in Spool 4.
5) We do an all-AMPs SUM step to aggregate from Spool 7 by way of an
all-rows scan. Aggregate Intermediate Results are computed
globally, then placed in Spool 8.
6) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 8 (Last Use) by
way of an all-rows scan into Spool 2 (all_amps), which is
duplicated on all AMPs.
2) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by
way of an all-rows scan into Spool 3 (all_amps), which is
duplicated on all AMPs.
7) We do an all-AMPs RETRIEVE step from cvdtst.S_KINAIZAIKO_COL1221OU
by way of an all-rows scan with no residual conditions into Spool
6 (all_amps), which is redistributed by the hash code of (
cvdtst.S_KINAIZAIKO_COL1221OU.BOT_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.KAISYA_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.TKS_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.LOCA_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.NHIN_BSHO_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.SHO_CD,
cvdtst.S_KINAIZAIKO_COL1221OU.HOTCOLD_FLG) to all AMPs. Then we
do a SORT to order Spool 6 by row hash, and null value information
in Spool 3 and Spool 2. Skip this retrieve step if null exists.
The size of Spool 6 is estimated with low confidence to be
3,257,496 rows (218,252,232 bytes). The estimated time for this
step is 1.91 seconds.
8) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of
an all-rows scan, which is joined to Spool 7 by way of an
all-rows scan. Spool 6 and Spool 7 are joined using an
exclusion merge join, with a join condition of ("(BOT_CD =
BOT_CD) AND ((KAISYA_CD = KAISYA_CD) AND ((TKS_CD = TKS_CD)
AND ((LOCA_CD = LOCA_CD) AND ((NHIN_BSHO_CD = NHIN_BSHO_CD)
AND ((SHO_CD = SHO_CD) AND (HOTCOLD_FLG = BUNRUI_CD ))))))"),
and null value information in Spool 3 and Spool 2. Skip this
join step if null exists. The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size
of Spool 1 is estimated with no confidence to be 3,257,496
rows (244,312,200 bytes). The estimated time for this step
is 1.02 seconds.
2) We do an all-AMPs RETRIEVE step from
cvdtst.S_KINAIZAIKO_COL1221OU by way of an all-rows scan with
no residual conditions into Spool 10 (all_amps), which is
redistributed by the hash code of (
cvdtst.S_KINAIZAIKO_COL1221OU.BOT_CD) to all AMPs. Then we
do a SORT to order Spool 10 by row hash, and null value
information in Spool 3 and Spool 2. Skip this retrieve step
if there is no null. The size of Spool 10 is estimated with
low confidence to be 3,257,496 rows (218,252,232 bytes). The
estimated time for this step is 1.91 seconds.
9) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by way of
an all-rows scan into Spool 11 (all_amps), which is redistributed
by the hash code of (cvdtst.S_SLIP1_SHO1221OU.BOT_CD) to all AMPs.
Then we do a SORT to order Spool 11 by row hash, and null value
information in Spool 3 and Spool 2. Skip this retrieve step if
there is no null. The size of Spool 11 is estimated with no
confidence to be 11,078,203 rows (1,074,585,691 bytes). The
estimated time for this step is 7.68 seconds.
10) We do an all-AMPs JOIN step from Spool 10 (Last Use) by way of an
all-rows scan, which is joined to Spool 11 (Last Use) by way of an
all-rows scan. Spool 10 and Spool 11 are joined using an
exclusion merge join, with a join condition of ("(BOT_CD = BOT_CD)
AND ((KAISYA_CD = KAISYA_CD) AND ((TKS_CD = TKS_CD) AND ((LOCA_CD
= LOCA_CD) AND ((NHIN_BSHO_CD = NHIN_BSHO_CD) AND ((SHO_CD =
SHO_CD) AND (HOTCOLD_FLG = BUNRUI_CD ))))))"), and null value
information in Spool 3 (Last Use) and Spool 2 (Last Use). Skip
this join step if there is no null. The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with no confidence to be 3,257,496 rows (
244,312,200 bytes). The estimated time for this step is 1.02
seconds.
11) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

TERADATA HAVE TO CONFIRM IF NULL VALUE EXISTS ON EVERY "NOT IN" COLUMN,
AND IT IS VERY COSTLY AND WILL GREATLY IMPACT THE SYSTEM'S PERFORMANCE.
WE CAN AVOID IT BY SPECIFIING "NOT NULL" PROPERTY ON THE "NOT IN" COLUMN

HERE IS A ANOTHER SAMPLE IN JAPANESE
http://d.hatena.ne.jp/tgk/20100913

-------------------------------------------------------------------------------
THE FOLLOWING IS SQL AND EXPLAIN IN TERADATA V2R6

explain SELECT
BOT_CD
,KAISYA_CD
,LOCA_CD
,COL_NO
,MAC_CD
,SHO_CD
,HEN_ZAIKO_SU
,KINAI_ZAIKO_SU
,HOTCOLD_FLG
FROM
S_KINAIZAIKO_COL/* S_KINAIZAIKO_COL */
WHERE
(
BOT_CD
,KAISYA_CD
,LOCA_CD
,MAC_CD
,SHO_CD
,HOTCOLD_FLG
)
NOT IN
(
SELECT
BOT_CD
,KAISYA_CD
,LOCA_CD
,MAC_CD
,SHO_CD
,BUNRUI_CD
FROM
S_SLIP1_SHO/* S_SLIP1_SHO */
WHERE
SURYOU > 0
)

1) First, we lock a distinct HWKRUN34."pseudo table" for read on a
RowHash to prevent global deadlock for HWKRUN34.S_SLIP1_SHO.
2) Next, we lock a distinct HWKRUN34."pseudo table" for read on a
RowHash to prevent global deadlock for HWKRUN34.S_KINAIZAIKO_COL.
3) We lock HWKRUN34.S_SLIP1_SHO for read, and we lock
HWKRUN34.S_KINAIZAIKO_COL for read.
4) We do an all-AMPs SUM step to aggregate from
HWKRUN34.S_KINAIZAIKO_COL by way of an all-rows scan with no
residual conditions. Aggregate Intermediate Results are computed
globally, then placed in Spool 3.
5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs.
6) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from
HWKRUN34.S_KINAIZAIKO_COL by way of an all-rows scan with no
residual conditions into Spool 5 (all_amps) (compressed
columns allowed), which is redistributed by hash code to all
AMPs. Then we do a SORT to order Spool 5 by row hash, and
null value information in Spool 2. Skip this retrieve step
if null exists. The size of Spool 5 is estimated with low
confidence to be 3,663,600 rows. The estimated time for this
step is 4.55 seconds.
2) We do an all-AMPs RETRIEVE step from HWKRUN34.S_SLIP1_SHO by
way of an all-rows scan with a condition of (
"HWKRUN34.S_SLIP1_SHO.SURYOU > 0.0") into Spool 6 (all_amps),
which is redistributed by hash code to all AMPs. Then we do
a SORT to order Spool 6 by row hash and the sort key in spool
field1 eliminating duplicate rows. The input table will not
be cached in memory, but it is eligible for synchronized
scanning. The size of Spool 6 is estimated with no
confidence to be 11,532,920 rows. The estimated time for
this step is 18.35 seconds.
7) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of
an all-rows scan, which is joined to Spool 6 by way of an
all-rows scan. Spool 5 and Spool 6 are joined using an
exclusion merge join, with a join condition of ("(BOT_CD =
BOT_CD) AND ((KAISYA_CD = KAISYA_CD) AND ((LOCA_CD = LOCA_CD)
AND ((MAC_CD = MAC_CD) AND ((SHO_CD = SHO_CD) AND
(HOTCOLD_FLG = BUNRUI_CD )))))"), and null value information
in Spool 2. Skip this join step if null exists. The result
goes into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with index join
confidence to be 3,663,600 rows. The estimated time for this
step is 1.47 seconds.
2) We do an all-AMPs RETRIEVE step from
HWKRUN34.S_KINAIZAIKO_COL by way of an all-rows scan with no
residual conditions into Spool 7 (all_amps) (compressed
columns allowed), which is redistributed by hash code to all
AMPs. Then we do a SORT to order Spool 7 by row hash, and
null value information in Spool 2. Skip this retrieve step
if there is no null. The size of Spool 7 is estimated with
low confidence to be 3,663,600 rows. The estimated time for
this step is 4.55 seconds.
8) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of
an all-rows scan into Spool 8 (all_amps) (compressed columns
allowed), which is redistributed by hash code to all AMPs. Then
we do a SORT to order Spool 8 by row hash, and null value
information in Spool 2. Skip this retrieve step if there is no
null. The size of Spool 8 is estimated with no confidence to be
11,532,920 rows. The estimated time for this step is 18.35
seconds.
9) We do an all-AMPs JOIN step from Spool 7 (Last Use) by way of an
all-rows scan, which is joined to Spool 8 (Last Use) by way of an
all-rows scan. Spool 7 and Spool 8 are joined using an exclusion
merge join, with a join condition of ("(BOT_CD = BOT_CD) AND
((KAISYA_CD = KAISYA_CD) AND ((LOCA_CD = LOCA_CD) AND ((MAC_CD =
MAC_CD) AND ((SHO_CD = SHO_CD) AND (HOTCOLD_FLG = BUNRUI_CD )))))"),
and null value information in Spool 2 (Last Use). Skip this join
step if there is no null. The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with index join confidence to be 3,663,600
rows. The estimated time for this step is 1.47 seconds.
10) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

Teradata13在处理NOT IN查询的时候，如果NOT IN字段没有NOT NULL定义的话，TERADATA会确认每一个字段是否为NULL值，
这会导致数次FTS，极大的影响性能。解决方式是，给NOT IN项目定义NOT IN，或者把NOT IN改为左连接

不过在老版本的Teradata中（比如说V2R6），似乎并不存在这样的问题，目前无法完全确定，留待以后研究