INDEX
Explanations
mentions of authority or power dynamics
instances of the word "has" and its variations in context
New Auto-Interp
Negative Logits
Pair
-0.68
filming
-0.63
dot
-0.63
recalls
-0.62
etter
-0.62
TG
-0.61
etting
-0.59
burse
-0.59
hooting
-0.58
umping
-0.57
POSITIVE LOGITS
been
1.27
gotten
1.05
been
1.03
behaved
1.02
gone
1.01
become
0.99
begun
0.96
oken
0.94
ceased
0.93
fallen
0.93
Activations Density 0.378%