INDEX
Explanations
phrases that express comparisons and similarities between experiences or concepts
New Auto-Interp
Negative Logits
mability
-0.67
Джерела
-0.59
明明
-0.57
fubject
-0.57
purpoſe
-0.57
Silla
-0.56
Enllaços
-0.55
diſt
-0.55
tetten
-0.55
losis
-0.55
POSITIVE LOGITS
like
0.76
enumii
0.72
Like
0.65
like
0.64
Like
0.63
kuten
0.59
seperti
0.58
giống
0.57
LIKE
0.57
kuin
0.57
Activations Density 0.520%