INDEX
Explanations
words related to specific locations, potentially with references to events or people associated with them
references to proper nouns or entities, particularly names and titles
New Auto-Interp
Negative Logits
theless
-0.65
REDACTED
-0.63
Sina
-0.62
ablishment
-0.61
props
-0.60
GOODMAN
-0.59
cops
-0.59
kids
-0.58
advertisers
-0.58
IUM
-0.57
POSITIVE LOGITS
utsch
0.97
ymes
0.91
astery
0.89
eworks
0.86
actor
0.85
ule
0.84
iasco
0.80
utor
0.79
oub
0.78
iner
0.77
Activations Density 0.272%