INDEX
Explanations
narratives involving escape and survival
New Auto-Interp
Negative Logits
bottom
-0.17
bottom
-0.16
forcer
-0.14
iston
-0.14
asing
-0.14
Feed
-0.14
Ulus
-0.14
-bottom
-0.14
å·¡
-0.14
435
-0.13
POSITIVE LOGITS
escape
0.75
Escape
0.62
escapes
0.61
escaping
0.59
escape
0.59
Escape
0.57
escaped
0.57
éĢĥ
0.57
flee
0.56
fleeing
0.51
Activations Density 0.243%