INDEX
Explanations
references to emotions and reactions related to thoughts and actions
New Auto-Interp
Negative Logits
ourselves
-0.71
we
-0.67
让我们
-0.66
讓我們
-0.59
нами
-0.57
yourselves
-0.56
vimos
-0.55
weil
-0.55
we
-0.54
我们在
-0.54
POSITIVE LOGITS
Slowly
0.85
Glan
0.78
Slowly
0.70
“
0.69
Carefully
0.69
Sigh
0.65
Maybe
0.65
Surely
0.65
Turning
0.64
Something
0.64
Activations Density 0.113%