INDEX
Explanations
duplicates or similarities
references to the concept of identicality or similarity
New Auto-Interp
Negative Logits
stra
-0.74
raq
-0.72
================================================================
-0.70
Explore
-0.67
âĵĺ
-0.66
bane
-0.65
Pal
-0.65
Fan
-0.65
veland
-0.64
HI
-0.62
POSITIVE LOGITS
twins
1.21
twin
0.99
icut
0.88
etrical
0.83
lihood
0.83
identical
0.81
minded
0.74
pairs
0.73
etry
0.72
sized
0.71
Activations Density 0.036%