INDEX
Explanations
references to significant historical figures and events
New Auto-Interp
Negative Logits
itself
-0.20
quine
-0.17
koje
-0.17
ibri
-0.16
ear
-0.15
estroy
-0.15
ocale
-0.15
å®ĥ们
-0.15
ίθ
-0.15
stalo
-0.14
POSITIVE LOGITS
himself
0.31
whom
0.28
his
0.22
who
0.20
/her
0.20
whose
0.20
Himself
0.18
his
0.17
ÙĨÙ쨳Ùĩ
0.17
jeho
0.16
Activations Density 0.460%