INDEX
Explanations
verbs indicating actions towards other people or things
actions that involve treatment or outcomes impacting individuals or groups
New Auto-Interp
Negative Logits
conn
-0.76
eur
-0.68
rea
-0.62
bomb
-0.61
Newman
-0.61
bow
-0.61
zh
-0.61
sw
-0.61
leases
-0.59
tone
-0.58
POSITIVE LOGITS
ometimes
1.10
omething
1.01
paces
0.95
hift
0.90
ynthesis
0.87
pace
0.85
ilver
0.84
Jagu
0.84
ettings
0.78
heet
0.77
Activations Density 0.497%