INDEX
Explanations
instances of the word "honest"
New Auto-Interp
Negative Logits
acid
-0.74
joined
-0.71
chairs
-0.68
Krish
-0.68
arthy
-0.67
interrupted
-0.67
Ĥİ
-0.67
İĭ
-0.66
ombat
-0.65
lav
-0.65
POSITIVE LOGITS
broker
0.87
honest
0.87
appraisal
0.82
truthful
0.79
honesty
0.77
candid
0.73
spection
0.70
frank
0.69
mistake
0.69
Reporting
0.68
Activations Density 0.025%