INDEX
Explanations
phrases related to politics, power dynamics, and societal issues
New Auto-Interp
Negative Logits
ichen
-0.76
quire
-0.68
queue
-0.68
adish
-0.68
ades
-0.67
uled
-0.65
lette
-0.65
consulted
-0.63
mentioned
-0.62
reimb
-0.62
POSITIVE LOGITS
seriousness
1.20
superiority
1.03
greatness
1.00
absurdity
0.99
resilience
0.99
individuality
0.98
masculinity
0.98
sincerity
0.98
versatility
0.98
willingness
0.98
Activations Density 1.731%