INDEX

Explanations

don

words and phrases related to violent, destructive, or unethical actions and intentions.

New Auto-Interp

Configuration

Dataset (Dashboard)

Various

Embeds

PlotsExplanationShow Test FieldDefault Test Text

IFrame

Link

Evil Persona

Negative Logits

tool

-0.07

 кле

-0.06

.angle

-0.06

 bladder

-0.06

 jeep

-0.06

 чет

-0.06

τισ

-0.06

.look

-0.06

_GRID

-0.06

 polish

-0.06

POSITIVE LOGITS

_updates

0.07

�

0.07

	uint

0.07

 likes

0.07

;&#

0.07

 Libraries

0.07

	conn

0.06

(Initialized

0.06

Zub

0.06

jure

0.06

Activations Density 0.048%