INDEX
Explanations
words related to causing harm or injury
causing harm or injury
New Auto-Interp
Negative Logits
뀐
-0.43
koke
-0.40
뀜
-0.40
zelt
-0.39
decision
-0.39
lengu
-0.39
usias
-0.38
Forscher
-0.38
Erben
-0.38
negó
-0.38
POSITIVE LOGITS
causing
0.83
causing
0.77
inflicting
0.66
harm
0.60
causando
0.60
damage
0.60
addGap
0.59
gây
0.57
addPreferredGap
0.56
Damage
0.54
Activations Density 0.022%