INDEX
Explanations
punctuation
The neuron detects occurrences of the word “explanation” (e.g. in “do not give any explanation”).
New Auto-Interp
Negative Logits
th�
-0.06
طل
-0.06
�
-0.06
WIDTH
-0.06
◎
-0.06
CFR
-0.06
wol
-0.06
Seen
-0.06
thinly
-0.06
タル
-0.06
POSITIVE LOGITS
.vertical
0.07
Static
0.07
Algorithm
0.06
_gamma
0.06
Sampling
0.06
Ге
0.06
ahead
0.06
’da
0.06
ером
0.06
boyfriend
0.06
Activations Density 0.008%