INDEX
Explanations
terms related to corruption or unethical behavior
New Auto-Interp
Negative Logits
gap
-0.85
Anxiety
-0.78
ruck
-0.73
joice
-0.72
iphany
-0.69
zig
-0.68
fleet
-0.68
gain
-0.67
oleon
-0.67
plane
-0.67
POSITIVE LOGITS
dealings
0.87
ly
0.85
ibly
0.83
ible
0.80
ulent
0.79
nesses
0.78
NESS
0.78
ingly
0.74
glers
0.72
ness
0.72
Activations Density 0.059%