INDEX
Explanations
references to individuals and their characteristics or societal roles
New Auto-Interp
Negative Logits
kker
-0.16
lew
-0.15
stuff
-0.14
Honor
-0.14
agne
-0.13
logic
-0.13
ving
-0.13
eri
-0.13
ordo
-0.13
olib
-0.13
POSITIVE LOGITS
known
0.43
known
0.42
Known
0.41
Known
0.38
-known
0.35
_known
0.33
извеÑģÑĤ
0.29
famous
0.26
bekannt
0.26
KNOWN
0.25
Activations Density 0.006%