INDEX
Explanations
phrases indicating feelings of empathy or sympathy towards others
New Auto-Interp
Negative Logits
edin
-0.75
hess
-0.75
UP
-0.74
FORE
-0.73
ashtra
-0.72
forward
-0.69
oller
-0.67
orbit
-0.66
mare
-0.66
ohn
-0.65
POSITIVE LOGITS
bidden
1.08
gotten
1.06
geries
0.87
starters
0.79
ties
0.78
sake
0.78
example
0.76
centuries
0.75
them
0.72
decades
0.71
Activations Density 0.144%