INDEX
Explanations
mentions of beneficence or beneficial actions
New Auto-Interp
Negative Logits
ghi
-0.16
hop
-0.15
ÏĢη
-0.15
otope
-0.15
aseline
-0.15
ieri
-0.15
orc
-0.14
locks
-0.14
-shadow
-0.14
ork
-0.14
POSITIVE LOGITS
volent
0.33
vol
0.25
iciary
0.21
ath
0.20
icial
0.20
icia
0.20
itting
0.19
ific
0.19
ift
0.18
fits
0.17
Activations Density 0.009%