INDEX
Explanations
references to agents in various contexts
New Auto-Interp
Negative Logits
оби
-0.17
678
-0.15
ara
-0.15
yat
-0.15
rk
-0.15
erd
-0.15
ستاÙĨ
-0.15
ble
-0.15
ux
-0.14
erras
-0.14
POSITIVE LOGITS
nesty
0.18
.Agent
0.17
provoc
0.16
inel
0.15
urons
0.15
415
0.15
ooled
0.15
apor
0.15
bab
0.14
otts
0.14
Activations Density 0.011%