INDEX
Explanations
specific names and proper nouns related to individuals or groups
New Auto-Interp
Negative Logits
keit
-0.22
tection
-0.18
stery
-0.17
er
-0.17
inton
-0.16
ób
-0.16
erer
-0.15
arness
-0.15
ged
-0.15
lette
-0.15
POSITIVE LOGITS
ertainment
0.25
ucky
0.21
itled
0.20
sov
0.18
t
0.17
tir
0.17
ech
0.17
ilation
0.17
roduced
0.17
une
0.16
Activations Density 0.090%