INDEX
Explanations
harm and violence inflicted
New Auto-Interp
Negative Logits
Captured
0.73
peligros
0.67
ogs
0.66
Dangerous
0.65
잡
0.64
ப்படி
0.64
難しい
0.63
catch
0.62
fooled
0.62
ITICAL
0.62
POSITIVE LOGITS
inflicted
2.57
inflict
2.06
inflicting
1.86
infliction
1.85
imposed
1.79
imposition
1.57
heaped
1.43
directed
1.39
endured
1.38
suffered
1.37
Activations Density 0.221%