INDEX
Explanations
phrases mentioning specific individuals or groups
the word "them" in various contexts
New Auto-Interp
Negative Logits
Deal
-0.75
Press
-0.69
Jo
-0.66
order
-0.65
Chain
-0.65
ILE
-0.65
deal
-0.64
Patton
-0.63
Rush
-0.63
Monster
-0.62
POSITIVE LOGITS
selves
1.14
atically
1.00
atic
0.99
selves
0.87
conduc
0.81
outwe
0.75
self
0.70
sinks
0.70
succeeded
0.70
atics
0.69
Activations Density 0.038%