INDEX
Explanations
phrases or words related to potential danger or harm
phrases that indicate various types of risk or danger
New Auto-Interp
Negative Logits
lins
-0.70
anche
-0.61
icut
-0.61
cest
-0.61
Glock
-0.60
blem
-0.60
encers
-0.60
urd
-0.60
awoken
-0.60
ricks
-0.60
POSITIVE LOGITS
EStream
0.86
groups
0.72
face
0.71
taker
0.70
taking
0.67
ulkan
0.67
mosqu
0.66
pestic
0.65
harbour
0.64
________________________
0.64
Activations Density 0.010%