INDEX
Explanations
negative sentiments and adverse outcomes
harm, falsehoods, or errors
New Auto-Interp
Negative Logits
GenerationType
-0.76
-0.62
stage
-0.60
BoxFit
-0.57
:✨
-0.54
special
-0.54
uLocal
-0.52
Stage
-0.51
preside
-0.51
seeds
-0.51
POSITIVE LOGITS
Gewalt
0.43
iestety
0.42
ويكيپيديا
0.39
śmier
0.37
victimes
0.34
KURZBESCHREIBUNG
0.33
locaust
0.33
Violence
0.33
niestety
0.33
toxicity
0.33
Activations Density 0.466%