INDEX
Explanations
explicit violence or crime-related details in a news context
New Auto-Interp
Negative Logits
»Ĵ
-0.60
ascript
-0.58
uously
-0.57
displayText
-0.57
izoph
-0.57
Tanz
-0.57
irtual
-0.56
itures
-0.56
iasm
-0.55
ENDED
-0.55
POSITIVE LOGITS
wen
0.71
ghan
0.66
ewater
0.66
ster
0.65
erville
0.64
coe
0.61
wyn
0.61
sters
0.60
heit
0.60
vel
0.58
Activations Density 0.100%