INDEX
Explanations
elements related to novelty and new experiences
New Auto-Interp
Negative Logits
bot
-0.14
æ°ĹãģĮ
-0.14
Anders
-0.14
ека
-0.14
asta
-0.13
ente
-0.13
014
-0.13
¤í
-0.13
617
-0.13
Initialization
-0.13
POSITIVE LOGITS
never
0.61
never
0.54
Never
0.52
Never
0.48
nunca
0.47
NEVER
0.45
hadn
0.40
никогда
0.40
haven
0.39
jamais
0.38
Activations Density 0.220%