INDEX
Explanations
words that signify suffering or negative impacts of actions
New Auto-Interp
Negative Logits
ipple
-0.15
Configurer
-0.15
rops
-0.15
ones
-0.14
osa
-0.14
af
-0.14
adla
-0.14
Wich
-0.14
iers
-0.13
lesh
-0.13
POSITIVE LOGITS
nature
0.21
nature
0.16
involved
0.15
Nature
0.15
aspect
0.15
they
0.15
pirit
0.15
à¹Įà¸Ł
0.14
implicit
0.14
865
0.14
Activations Density 0.263%