INDEX
Explanations
tokens from the user's messages (i.e., user-turn tokens/questions).
New Auto-Interp
Negative Logits
Rua
-0.07
串
-0.07
craper
-0.07
redi
-0.06
OUSE
-0.06
.raise
-0.06
_NOP
-0.06
ounded
-0.06
族
-0.06
Fairy
-0.06
POSITIVE LOGITS
img
0.07
comments
0.07
err
0.07
liberalism
0.06
_sid
0.06
enhance
0.06
affiliates
0.06
Expect
0.06
schnell
0.06
leicht
0.06
Activations Density 1.264%