INDEX
Explanations
references to specific individuals associated with politics or current affairs
New Auto-Interp
Negative Logits
antro
-0.17
itself
-0.14
oretical
-0.14
aroo
-0.14
dbg
-0.13
.fm
-0.13
quel
-0.13
egend
-0.13
azzi
-0.13
aan
-0.13
POSITIVE LOGITS
Jr
0.28
çε
0.23
åįļ士
0.19
himself
0.18
nesty
0.18
jr
0.17
who
0.17
who
0.17
III
0.16
PhD
0.15
Activations Density 0.829%