INDEX
Explanations
phrases indicating responsibility or actions of specific individuals or groups
phrases that indicate accountability or responsibility
New Auto-Interp
Negative Logits
aukee
-0.73
rique
-0.70
ilial
-0.70
ricane
-0.70
anus
-0.69
onut
-0.67
poon
-0.67
ruary
-0.65
avorite
-0.64
basil
-0.64
POSITIVE LOGITS
...]
0.72
offs
0.62
ainer
0.61
theoret
0.60
erous
0.59
aders
0.57
urers
0.57
uary
0.57
attm
0.57
ography
0.57
Activations Density 0.015%