INDEX
Explanations
instances of critical errors, consequences, and health-related issues
New Auto-Interp
Negative Logits
uros
-0.15
actors
-0.15
uffers
-0.14
QRST
-0.14
aggrav
-0.14
Durch
-0.14
rawn
-0.14
734
-0.14
intimidating
-0.13
enty
-0.13
POSITIVE LOGITS
cost
0.40
cost
0.39
Cost
0.34
costs
0.34
Cost
0.33
COST
0.33
-cost
0.32
_cost
0.29
.cost
0.29
Costs
0.29
Activations Density 0.265%