INDEX
Explanations
references to individuals and their roles or actions within specific contexts
New Auto-Interp
Negative Logits
ONLY
-0.64
honestly
-0.64
truly
-0.64
whatever
-0.62
finally
-0.62
nevertheless
-0.62
nonetheless
-0.61
anything
-0.60
aten
-0.60
only
-0.59
POSITIVE LOGITS
ypes
0.80
ebted
0.72
sie
0.71
racted
0.70
Cosponsors
0.70
GOODMAN
0.69
agonists
0.67
urbed
0.65
iac
0.65
ãĤ©
0.64
Activations Density 0.217%