INDEX
Explanations
mentions of criticism regarding media or artistic works
New Auto-Interp
Negative Logits
--
-0.22
--↵
-0.22
our
-0.21
ï
-0.20
à¥ľ
-0.19
ours
-0.19
‘
-0.18
--↵
-0.18
---
-0.18
our
-0.18
POSITIVE LOGITS
,[
0.46
[c
0.45
.[
0.42
:[
0.39
[
0.36
^{[0.35
↵
0.35
[[
0.35
).[
0.35
{{0.32
Activations Density 1.372%