INDEX
Explanations
mentions of responsible behavior or actions
references to responsibility and responsible behavior
New Auto-Interp
Negative Logits
chu
-0.75
frey
-0.72
stals
-0.72
mirac
-0.69
ammy
-0.68
bows
-0.68
tantal
-0.68
forts
-0.66
yip
-0.65
OUT
-0.65
POSITIVE LOGITS
behaviour
0.99
behavior
0.97
citizen
0.89
entreprene
0.88
governance
0.84
tarian
0.84
stewards
0.84
adult
0.81
manner
0.79
conduct
0.78
Activations Density 0.140%