INDEX
Explanations
mentions of specific entities or people
instances of a specific character or symbol, possibly related to a unique formatting style
New Auto-Interp
Negative Logits
condem
-0.78
matic
-0.71
enegger
-0.69
ktop
-0.66
uay
-0.66
ulative
-0.66
raints
-0.65
ulators
-0.64
misunder
-0.64
lapt
-0.63
POSITIVE LOGITS
âĶĢâĶĢ
1.21
ï¸ı
1.10
âĶĢâĶĢâĶĢâĶĢ
0.99
×Ķ
0.89
conom
0.88
×ķ
0.88
λ
0.87
ishable
0.84
jj
0.82
âĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢ
0.82
Activations Density 0.269%