INDEX
Explanations
mentions of failure or lack of success
instances of the word "failure" in various contexts
New Auto-Interp
Negative Logits
utra
-0.81
selves
-0.76
enfranch
-0.75
rete
-0.69
othy
-0.67
riel
-0.66
Sed
-0.66
enta
-0.63
ople
-0.62
atu
-0.62
POSITIVE LOGITS
miser
1.20
failures
0.82
DEV
0.80
Failure
0.78
dism
0.77
lust
0.74
ulence
0.73
luster
0.73
failure
0.71
fail
0.70
Activations Density 0.030%