INDEX
Explanations
threats or situations that pose a potential danger or risk
terms related to threats or dangers, particularly in a political or social context
New Auto-Interp
Negative Logits
gel
-0.76
linen
-0.73
donkey
-0.72
camel
-0.70
cycling
-0.69
carriage
-0.67
ppo
-0.66
bicycles
-0.66
atari
-0.66
complementary
-0.64
POSITIVE LOGITS
Threat
3.28
threat
3.00
embold
1.11
fright
1.10
Resp
0.94
Attempts
0.93
hast
0.91
Hast
0.90
vironment
0.88
Stun
0.87
Activations Density 0.040%