INDEX
Explanations
phrases related to behaviors or actions
instances of the word "actions" in relation to moral responsibility or consequences
New Auto-Interp
Negative Logits
bid
-0.73
BLE
-0.65
AES
-0.65
mbuds
-0.64
Dise
-0.64
orf
-0.62
inately
-0.62
ondo
-0.61
used
-0.61
definition
-0.60
POSITIVE LOGITS
uations
1.05
ACTIONS
0.99
uate
0.98
actions
0.95
uated
0.89
uation
0.89
uary
0.86
hops
0.85
uating
0.84
ives
0.81
Activations Density 0.025%