INDEX
Explanations
negation or concepts of unfairness and unreasonableness
New Auto-Interp
Negative Logits
actively
-0.16
isible
-0.15
orgot
-0.14
ensitive
-0.14
singular
-0.14
å²³
-0.14
usch
-0.14
áp
-0.14
disposed
-0.13
assume
-0.13
POSITIVE LOGITS
scr
0.21
sustainable
0.19
wise
0.18
pal
0.17
mask
0.16
just
0.16
scr
0.16
Barg
0.16
brid
0.16
fruit
0.16
Activations Density 0.025%