INDEX
Explanations
words related to arguments or debates
linguistic forms that denote actions or characteristics
New Auto-Interp
Negative Logits
ilities
-0.64
Seym
-0.63
ulates
-0.62
raints
-0.61
ility
-0.60
ij士
-0.60
ulating
-0.59
ADRA
-0.59
SU
-0.59
ordinary
-0.58
POSITIVE LOGITS
oad
1.08
ength
1.00
oaded
0.93
gling
0.90
ibrary
0.89
uci
0.88
ogue
0.86
phrine
0.78
erie
0.77
xual
0.77
Activations Density 0.105%