INDEX
Explanations
non-English characters or gibberish symbols
occurrences of a specific character or symbol in various contexts
New Auto-Interp
Negative Logits
Gap
-0.85
ierrez
-0.74
Extrem
-0.68
aneers
-0.67
raints
-0.65
htaking
-0.64
itches
-0.63
similarity
-0.63
ativity
-0.63
gewater
-0.62
POSITIVE LOGITS
İ
1.17
å§«
1.11
士
1.11
ãĤ¨ãĥ«
1.09
âĶĢâĶĢ
0.98
Ü
0.97
¯¯¯¯
0.89
âĸijâĸij
0.87
女
0.86
¯¯
0.84
Activations Density 0.001%