INDEX
Explanations
specific formatting or structural elements in the text, particularly those related to citations or references
New Auto-Interp
Negative Logits
albert
-0.80
рян
-0.74
fout
-0.72
[toxicity=0]
-0.70
Blak
-0.70
viel
-0.70
lık
-0.69
Raton
-0.69
ASE
-0.67
inode
-0.67
POSITIVE LOGITS
¡¡
1.15
wikipagina
0.91
)**
0.89
(**
0.88
.**
0.86
/****
0.86
]**
0.81
kwargs
0.80
¡¡
0.79
{!!0.78
Activations Density 0.349%