INDEX
Explanations
instances of deception or misrepresentation
New Auto-Interp
Negative Logits
circuits
-0.15
oren
-0.15
_OBJ
-0.14
celain
-0.14
zza
-0.14
Deutsche
-0.14
onne
-0.14
cases
-0.14
bs
-0.14
extr
-0.14
POSITIVE LOGITS
aul
0.16
miêu
0.16
Ấ
0.15
Cabin
0.15
άκ
0.15
лиÑĪ
0.15
acho
0.15
cellForRowAt
0.14
_representation
0.14
lea
0.14
Activations Density 0.003%