INDEX
Explanations
negative descriptions about societal issues or institutions
New Auto-Interp
Negative Logits
colors
-0.18
ehler
-0.17
favorable
-0.17
theaters
-0.17
favor
-0.17
unfavorable
-0.16
favors
-0.15
localized
-0.15
coloring
-0.15
Colors
-0.15
POSITIVE LOGITS
mate
0.32
bol
0.31
mates
0.29
sod
0.28
blo
0.26
blo
0.25
nonce
0.23
bol
0.23
proper
0.23
Oi
0.23
Activations Density 0.652%