INDEX
Explanations
words and phrases related to provocation and provocative speech
New Auto-Interp
Negative Logits
utex
-0.14
mad
-0.14
erez
-0.14
ern
-0.14
Rip
-0.14
optera
-0.14
ule
-0.13
絡
-0.13
ensch
-0.13
ä
-0.13
POSITIVE LOGITS
žen
0.18
/assert
0.17
eyin
0.16
CHASE
0.16
Äįin
0.15
zÄħd
0.15
lint
0.14
nutÃŃm
0.14
ãĥ©ãĥĥãĤ¯
0.14
nÃŃk
0.14
Activations Density 0.006%