INDEX
Explanations
specific names of individuals, likely related to certain professions or activities
proper nouns, particularly names of people and places
New Auto-Interp
Negative Logits
ggle
-0.71
uma
-0.70
pport
-0.70
ction
-0.69
fights
-0.68
matic
-0.68
fight
-0.66
enic
-0.64
xx
-0.63
umat
-0.63
POSITIVE LOGITS
imore
0.85
éĹĺ
0.81
lees
0.78
imer
0.75
arson
0.74
Redditor
0.74
stown
0.72
fences
0.71
espie
0.71
inelli
0.69
Activations Density 0.075%