INDEX
Explanations
phrases related to the negative impact of certain policies and actions
New Auto-Interp
Negative Logits
oppers
-0.15
ÅĻez
-0.14
vb
-0.14
Demir
-0.14
928
-0.14
éīĦ
-0.14
odom
-0.13
ама
-0.13
Ư
-0.13
oriously
-0.13
POSITIVE LOGITS
increase
0.19
only
0.19
instead
0.19
pand
0.17
increase
0.17
вмеÑģÑĤ
0.17
Increase
0.16
Div
0.16
Only
0.16
increased
0.16
Activations Density 0.321%