INDEX
Explanations
mentions of specific locations, especially cities
proper nouns and specific names
New Auto-Interp
Negative Logits
ments
-0.83
htaking
-0.81
ramer
-0.79
ontent
-0.76
comed
-0.75
tale
-0.73
rav
-0.73
isive
-0.72
region
-0.71
nut
-0.70
POSITIVE LOGITS
vous
0.79
ito
0.71
alez
0.66
--+
0.65
Corpus
0.64
Bee
0.64
Mercury
0.64
Od
0.63
Jinn
0.62
Perez
0.62
Activations Density 0.072%