INDEX
Explanations
the word "way" with varying activation levels
New Auto-Interp
Negative Logits
iasco
-0.73
lict
-0.72
uster
-0.72
livest
-0.70
usters
-0.69
uctor
-0.64
iners
-0.64
uates
-0.63
fumes
-0.62
ividual
-0.62
POSITIVE LOGITS
fare
1.18
ward
1.18
finding
1.05
points
1.04
point
1.03
forward
0.96
bill
0.93
WARD
0.91
Forward
0.89
cross
0.89
Activations Density 0.029%