INDEX
Explanations
phrases or words related to subjective judgments or opinions
complex phrases related to social issues and human experiences
New Auto-Interp
Negative Logits
arnaev
-0.61
confir
-0.56
ĪĴ
-0.56
Pok
-0.54
ãĥ¯ãĥ³
-0.53
Sorce
-0.53
arthed
-0.51
Jagu
-0.51
Orig
-0.50
streng
-0.50
POSITIVE LOGITS
".
2.29
",
2.22
"?
2.16
";
2.14
"!
2.09
"
2.02
":
2.01
"...
1.99
"â̦
1.99
".[
1.98
Activations Density 0.481%