INDEX
Explanations
phrases related to actions being taken or accepted
instances of the word "taken" in various contexts
New Auto-Interp
Negative Logits
eers
-0.72
reinforcement
-0.62
tions
-0.62
glers
-0.61
vine
-0.60
SPD
-0.60
cers
-0.59
ternity
-0.59
ichick
-0.57
ileaks
-0.56
POSITIVE LOGITS
aback
1.53
advantage
1.10
care
1.07
aways
1.03
seriously
0.98
apart
0.91
hostage
0.90
away
0.89
orally
0.84
awa
0.82
Activations Density 0.036%