INDEX
Explanations
references to goals, aims, and intentions within the text
New Auto-Interp
Negative Logits
ossal
-0.17
lf
-0.15
uden
-0.15
681
-0.15
culus
-0.15
erves
-0.14
plit
-0.14
алÑĥ
-0.14
165
-0.14
conom
-0.14
POSITIVE LOGITS
egot
0.19
IPA
0.16
maları
0.16
ewe
0.15
íķ
0.15
ivor
0.15
िव
0.14
lest
0.14
иÑĤом
0.14
Tro
0.14
Activations Density 0.143%