INDEX
Explanations
affirmative responses or confirmations to questions
New Auto-Interp
Negative Logits
ittel
-0.17
umba
-0.15
Esp
-0.14
ej
-0.14
hyth
-0.14
aly
-0.14
ÙĨج
-0.14
Attrib
-0.14
arty
-0.14
ez
-0.14
POSITIVE LOGITS
óst
0.14
mo
0.14
Lehr
0.14
_reduction
0.14
oslav
0.13
éİ
0.13
oproject
0.13
ordes
0.13
Trace
0.13
reak
0.13
Activations Density 0.101%