INDEX
Explanations
affirmative responses and positive confirmations
New Auto-Interp
Negative Logits
hal
-0.17
ton
-0.17
ect
-0.16
loo
-0.15
weg
-0.15
jes
-0.14
lo
-0.14
uma
-0.14
cin
-0.14
pton
-0.14
POSITIVE LOGITS
enia
0.17
Ñĥди
0.17
Ģìŀ¥
0.17
nick
0.16
udas
0.16
indy
0.15
gor
0.15
agher
0.15
Bias
0.15
/false
0.15
Activations Density 0.043%