INDEX
Explanations
instances of conditional statements describing potential actions
phrases emphasizing ability or potential actions
New Auto-Interp
Negative Logits
Uri
-0.62
UR
-0.60
IB
-0.59
Trin
-0.58
arch
-0.58
soever
-0.57
caution
-0.57
path
-0.56
contention
-0.55
Sod
-0.55
POSITIVE LOGITS
't
1.06
afford
0.94
muster
0.85
convince
0.82
help
0.78
adian
0.77
somehow
0.75
survive
0.74
reach
0.74
utils
0.73
Activations Density 0.084%