INDEX
Explanations
references to authority and control in the context of societal issues
New Auto-Interp
Negative Logits
acquaintances
-0.66
overflow
-0.62
beautifully
-0.61
briefly
-0.60
aliases
-0.59
unsuccessfully
-0.58
linem
-0.58
contemporaries
-0.58
renown
-0.58
asionally
-0.58
POSITIVE LOGITS
Same
0.69
footing
0.69
nuclear
0.68
osate
0.67
same
0.65
brunt
0.64
farious
0.63
ĪĴ
0.63
blame
0.62
ocratic
0.62
Activations Density 0.176%