INDEX
Explanations
The neuron fires on tokens that signal rude or insulting language (e.g., insults and offense words).
New Auto-Interp
Negative Logits
tempor
-0.07
.friend
-0.06
[a
-0.06
робота
-0.06
다
-0.06
conform
-0.06
976
-0.06
虎
-0.06
@[
-0.06
Jungle
-0.06
POSITIVE LOGITS
масс
0.07
orpor
0.07
ственно
0.06
เกม
0.06
elial
0.06
společnosti
0.06
země
0.06
Approximately
0.06
major
0.06
务
0.06
Activations Density 0.245%