INDEX
Explanations
safety and precaution-related information
New Auto-Interp
Negative Logits
descendants
-0.69
monopoly
-0.65
ynthesis
-0.63
Patent
-0.62
created
-0.61
indle
-0.61
Ended
-0.60
transformed
-0.60
ode
-0.60
magically
-0.60
POSITIVE LOGITS
beware
1.55
caution
1.46
avoid
1.12
heed
1.10
cautious
1.10
precautions
1.10
eware
1.08
careful
1.08
wary
1.08
Avoid
1.07
Activations Density 2.055%