INDEX
Explanations
specific terms or phrases related to toxicity and its effects
New Auto-Interp
Negative Logits
eſ
-0.62
ftance
-0.61
phim
-0.61
ftant
-0.60
énario
-0.59
citenamefont
-0.58
DockStyle
-0.58
ftances
-0.57
iffance
-0.56
ebvre
-0.56
POSITIVE LOGITS
0.67
f
0.50
vecka
0.48
investissements
0.48
w
0.47
M
0.47
h
0.47
B
0.46
ricerche
0.46
b
0.45
Activations Density 0.548%