INDEX
Explanations
statements/denials
The neuron fires on tokens appearing in formal denial or disclaimer language, such as “defamatory,” “never,” “has,” and other words in statements asserting innocence or refuting claims.
New Auto-Interp
Negative Logits
드로
-0.07
Kata
-0.06
ीदव
-0.06
guise
-0.06
-focused
-0.06
igrated
-0.06
.Auth
-0.06
basic
-0.06
เล
-0.06
.Project
-0.06
POSITIVE LOGITS
approaching
0.07
`{0.06
_css
0.06
ناب
0.06
route
0.06
pprint
0.06
�
0.06
getY
0.06
จำนวน
0.06
����
0.06
Activations Density 0.027%