INDEX
Explanations
phrases indicating belief or conviction
New Auto-Interp
Negative Logits
ãĥ³ãĥĨãĤ£
-0.18
resh
-0.17
igo
-0.16
hof
-0.14
aq
-0.14
ër
-0.14
Stern
-0.14
plements
-0.14
£p
-0.14
chin
-0.13
POSITIVE LOGITS
atatype
0.18
452
0.15
auté
0.14
957
0.14
chema
0.14
adro
0.14
649
0.13
difference
0.13
å·®
0.13
LOB
0.13
Activations Density 0.024%