INDEX
Explanations
informational phrases prompting action
requests for additional information
New Auto-Interp
Negative Logits
neighb
-0.68
artifacts
-0.63
lifeless
-0.61
meter
-0.61
opped
-0.61
anmar
-0.60
lihood
-0.60
impro
-0.58
throats
-0.56
odied
-0.55
POSITIVE LOGITS
about
1.18
regarding
1.15
ABOUT
1.02
pertaining
0.98
concerning
0.96
About
0.93
About
0.93
about
0.85
Regarding
0.83
relating
0.80
Activations Density 0.056%