INDEX
Explanations
words related to escaping or avoidance from difficult situations
New Auto-Interp
Negative Logits
ём
-0.52
er
-0.50
wsj
-0.48
hã
-0.48
ECR
-0.47
ms
-0.47
mo
-0.47
stoi
-0.47
dinners
-0.47
ൂ
-0.47
POSITIVE LOGITS
escape
0.96
escaped
0.96
escapes
0.91
Escape
0.87
unt
0.83
escaping
0.81
escaping
0.81
attributes
0.80
granate
0.80
trained
0.77
Activations Density 0.083%