INDEX
Explanations
phrases related to potential risks or negative consequences
phrases that indicate potential risks or threats
New Auto-Interp
Negative Logits
agent
-0.58
Sec
-0.55
Fem
-0.55
bab
-0.54
Desk
-0.53
Pod
-0.52
Agent
-0.52
preced
-0.52
grips
-0.52
mascul
-0.51
POSITIVE LOGITS
VIDIA
0.74
etheless
0.73
IUM
0.66
Electric
0.63
urai
0.62
served
0.61
ILCS
0.61
maxwell
0.61
CLR
0.61
electric
0.60
Activations Density 0.000%