INDEX
Explanations
statements of responsibility or attribution for certain actions or situations
New Auto-Interp
Negative Logits
quart
-0.82
ylon
-0.79
frey
-0.75
tering
-0.74
TERN
-0.73
cher
-0.73
chers
-0.72
zig
-0.70
ilet
-0.70
mare
-0.69
POSITIVE LOGITS
citiz
0.98
Ohio
0.86
stewards
0.80
orate
0.79
mischief
0.76
responsible
0.75
axter
0.75
compe
0.74
explan
0.73
behav
0.73
Activations Density 5.897%