INDEX
Explanations
mentions of negative aspects or consequences
New Auto-Interp
Negative Logits
rouse
-0.74
glas
-0.73
ynthesis
-0.72
Swords
-0.70
orthy
-0.70
aeda
-0.69
aukee
-0.68
Collider
-0.67
arya
-0.66
ILA
-0.66
POSITIVE LOGITS
der
0.86
plag
0.80
fully
0.79
havoc
0.78
ulent
0.78
inflicted
0.77
Clown
0.77
asses
0.76
heap
0.76
ged
0.76
Activations Density 4.465%