INDEX
Explanations
instances where actions or decisions are being taken
actions or measures taken in various contexts
New Auto-Interp
Negative Logits
utters
-0.64
Hitch
-0.64
ickers
-0.63
olls
-0.63
overed
-0.61
anny
-0.59
MPG
-0.58
ockey
-0.58
bey
-0.57
iddled
-0.56
POSITIVE LOGITS
toward
1.07
towards
1.00
against
0.95
internally
0.85
backward
0.84
whatsoever
0.80
forward
0.75
mitigating
0.75
regarding
0.75
backwards
0.74
Activations Density 0.099%