INDEX
Explanations
unintended negative outcomes
New Auto-Interp
Negative Logits
suitably
0.43
fiducia
0.42
truss
0.42
culis
0.41
Nom
0.40
correctement
0.40
Gru
0.40
Valor
0.39
INAL
0.38
Deux
0.38
POSITIVE LOGITS
unwanted
0.96
してしまう
0.89
unintended
0.86
undesired
0.86
uncontroll
0.85
unintentionally
0.85
uncontrollable
0.84
undesirable
0.83
ってしまう
0.82
inadvertently
0.77
Activations Density 0.258%