INDEX
Explanations
expressions of visibility and acknowledgment of actions or qualities
New Auto-Interp
Negative Logits
Reputation
-0.19
reputation
-0.15
anonymity
-0.15
anyak
-0.14
isper
-0.14
nika
-0.14
reput
-0.14
ruba
-0.14
ilha
-0.14
fame
-0.14
POSITIVE LOGITS
signs
0.33
Signs
0.30
initiative
0.23
leadership
0.22
symptoms
0.21
prowess
0.19
interest
0.19
boat
0.19
improvement
0.19
evidence
0.19
Activations Density 0.085%