INDEX
Explanations
proper nouns and names associated with specific entities or places
names and mentions of individuals, particularly those associated with specific stories or events
New Auto-Interp
Negative Logits
smoker
-0.77
========
-0.71
iaries
-0.68
hetically
-0.68
wered
-0.67
¥ŀ
-0.67
indust
-0.67
oult
-0.66
avorite
-0.66
ictions
-0.66
POSITIVE LOGITS
Ness
0.92
Rhodes
0.81
Loch
0.80
Afee
0.78
gow
0.77
inosaur
0.76
poke
0.75
Colossus
0.75
bones
0.74
Cro
0.74
Activations Density 0.023%