INDEX
Explanations
phrases related to concerns, issues, or disruptions
negative sentiments or issues related to safety and disruptions
New Auto-Interp
Negative Logits
liam
-0.71
eele
-0.66
ortment
-0.63
igham
-0.60
fw
-0.59
acha
-0.58
itled
-0.57
ku
-0.57
tsky
-0.56
ethyl
-0.56
POSITIVE LOGITS
whatsoever
1.72
nor
1.57
anymore
1.46
nor
1.01
anything
1.00
slightest
0.98
anybody
0.95
either
0.95
anywhere
0.93
except
0.87
Activations Density 0.264%