INDEX
Explanations
words indicating significant changes or impactful transformations
New Auto-Interp
Negative Logits
£½
-0.17
rica
-0.16
íĻĶ
-0.16
Král
-0.15
ieves
-0.15
otel
-0.15
Ñıк
-0.14
jac
-0.14
azel
-0.14
unfold
-0.14
POSITIVE LOGITS
agent
0.16
agent
0.16
leta
0.16
çĽ
0.15
inator
0.15
/support
0.15
ILA
0.15
piece
0.15
Agent
0.14
Agent
0.14
Activations Density 0.159%