INDEX
Explanations
words related to trustworthiness and credibility
terms related to worthiness or merit
New Auto-Interp
Negative Logits
ACA
-0.71
oan
-0.66
WAR
-0.66
eq
-0.65
udeb
-0.64
Wah
-0.63
ATHER
-0.61
hran
-0.60
ERA
-0.60
Shank
-0.60
POSITIVE LOGITS
worthy
1.13
nesses
1.04
ness
0.89
lihood
0.87
orthy
0.74
worthiness
0.74
iaries
0.72
worthy
0.70
otine
0.70
icles
0.69
Activations Density 0.012%