INDEX
Explanations
phrases indicative of regulations or policies
New Auto-Interp
Negative Logits
ãĥ£
-0.18
ازÙĩ
-0.16
xs
-0.15
ÏīÏĤ
-0.14
its
-0.14
fame
-0.13
allegedly
-0.13
ie
-0.13
tir
-0.13
ones
-0.13
POSITIVE LOGITS
which
0.24
which
0.19
Which
0.19
Which
0.18
cui
0.16
.which
0.16
Fal
0.15
esen
0.15
sinh
0.15
WHICH
0.15
Activations Density 0.048%