INDEX
Explanations
phrases emphasizing honesty
expressions related to honesty
New Auto-Interp
Negative Logits
Krish
-0.76
Libraries
-0.74
LAN
-0.74
interrupted
-0.72
acid
-0.70
enegger
-0.70
Autism
-0.69
levels
-0.68
515
-0.67
Stud
-0.67
POSITIVE LOGITS
honest
1.08
truthful
0.92
honesty
0.85
cipled
0.78
broker
0.76
onest
0.76
dece
0.73
parency
0.71
Honest
0.70
rencies
0.70
Activations Density 0.009%