INDEX
Explanations
words related to causation or consequences
New Auto-Interp
Negative Logits
rehabilit
-0.69
hitter
-0.68
abies
-0.60
veterinarian
-0.58
nurs
-0.57
batter
-0.57
neighb
-0.56
territ
-0.55
battered
-0.53
igger
-0.53
POSITIVE LOGITS
forth
2.24
forward
1.44
why
1.01
why
0.89
noon
0.80
entimes
0.80
ably
0.80
videos
0.77
fter
0.76
xual
0.73
Activations Density 0.009%