INDEX
Explanations
unintended or negative consequences
New Auto-Interp
Negative Logits
enhancing
0.39
apos
0.38
enhances
0.37
ornamented
0.37
Enh
0.36
activité
0.35
सराह
0.34
enhance
0.34
favorably
0.34
uko
0.33
POSITIVE LOGITS
consequences
3.36
Consequences
2.97
conséquences
2.88
consecuencias
2.86
repercussions
2.84
последствия
2.75
ramifications
2.69
consequências
2.64
implications
2.59
consequence
2.58
Activations Density 0.162%