INDEX
Explanations
fear, hesitation, then action
New Auto-Interp
Negative Logits
…,
0.94
Additionally
0.91
...
0.89
...),
0.86
...).
0.84
,
0.83
。,
0.82
Также
0.80
...,
0.80
므로
0.77
POSITIVE LOGITS
yeah
1.12
oblivious
1.10
yes
1.08
unable
1.04
powerless
0.95
refusing
0.95
unaware
0.93
bathed
0.93
afraid
0.93
humbled
0.92
Activations Density 0.366%