mtcute/packages/html-parser/README.md

# @mtcute/html-parser

📖 [API Reference](https://ref.mtcute.dev/modules/_mtcute_html_parser.html)

HTML entities parser for mtcute

> **NOTE**: The syntax implemented here is **incompatible** with Bot API _HTML_.
>
> Please read [Syntax](#syntax) below for a detailed explanation

## Features
- Supports all entities that Telegram supports
- Supports nested entities
- Proper newline/whitespace handling (just like in real HTML)
- [Interpolation](#interpolation)!

## Usage

```ts
import { html } from '@mtcute/html-parser'

tg.sendText(
    'me',
    html`
        Hello, <b>me</b>! Updates from the feed:<br>
        ${await getUpdatesFromFeed()}
    `
)
```

## Syntax

`@mtcute/html-parser` uses [`htmlparser2`](https://www.npmjs.com/package/htmlparser2) under the hood, so the parser
supports nearly any HTML. However, since the text is still processed in a custom way for Telegram, the supported subset
of features is documented below:

## Line breaks and spaces

Line breaks are **not** preserved, `<br>` is used instead,
making the syntax very close to the one used when building web pages.

Multiple spaces and indents are collapsed (except in `pre`), when you do need multiple spaces use `&nbsp;` instead.

## Inline entities

Inline entities are entities that are in-line with other text. We support these entities:

| Name             | Code                                                             | Result (visual)              |
| ---------------- | ---------------------------------------------------------------- | ---------------------------- |
| Bold             | `<b>text</b>`                                                    | **text**                     |
| Italic           | `<b>text</b>`                                                    | _text_                       |
| Underline        | `<u>text</u>`                                                    | <u>text</u>                  |
| Strikethrough    | `<s>text</s>`                                                    | ~~text~~                     |
| Spoiler          | `<spoiler>text</spoiler>` (or `tg-spoiler`)                      | N/A                          |
| Monospace (code) | `<code>text</code>`                                              | `text`                       |
| Text link        | `<a href="https://google.com">Google</a>`                        | [Google](https://google.com) |
| Text mention     | `<a href="tg://user?id=1234567">Name</a>`                        | N/A                          |
| Custom emoji     | `<emoji id="12345">😄</emoji>` (or `<tg-emoji emoji-id="...">`) | N/A                          |

> **Note**: `<strong>`, `<em>`, `<ins>`, `<strike>`, `<del>` are not supported because they are redundant

> **Note**: It is up to the client to look up user's input entity by ID for text mentions.
> In most cases, you can only use IDs of users that were seen by the client while using given storage.
>
> Alternatively, you can explicitly provide access hash like this:
> `<a href="tg://user?id=1234567&hash=abc">Name</a>`, where `abc` is user's access hash
> written as a hexadecimal integer. Order of the parameters does matter, i.e.
> `tg://user?hash=abc&id=1234567` will not be processed as expected.

## Block entities

The only block entity that Telegram supports is `<pre>`, therefore it is the only tag we support too.

Optionally, language for `<pre>` block can be specified like this:

```html
<pre language="typescript">export type Foo = 42</pre>
```

| Code                                                                                | Result (visual)              |
| ----------------------------------------------------------------------------------- | ---------------------------- |
| <pre>&lt;pre&gt;multiline\ntext&lt;/pre&gt;</pre>                                   | <pre>multiline<br>text</pre> |
| <pre>&lt;pre language="javascript"&gt;<br>  export default 42<br>&lt;/pre&gt;</pre> | <pre>export default 42</pre> |

## Nested and overlapped entities

HTML is a nested language, and so is this parser. It does support nested entities, but overlapped entities will not work
as expected!

Overlapping entities are supported in `unparse()`, though.

| Code                                                                                                                | Result (visual)                                                          |
|---------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|
| `<b>Welcome back, <i>User</i>!</b>`                                                                                 | **Welcome back, _User_!**                                                |
| `<b>bold <i>and</b> italic</i>`                                                                                     | **bold _and_** italic<br>⚠️ <i>word "italic" is not actually italic!</i> |
| `<b>bold <i>and</i></b><i> italic</i>`<br>⚠️ <i>this is how <code>unparse()</code> handles overlapping entities</i> | **bold _and_** _italic_                                                  |

## Interpolation

Being a tagged template literal, `html` supports interpolation.

You can interpolate one of the following:
- `string` - **will not** be parsed, and appended to plain text as-is
  - In case you want the string to be parsed, use `html` as a simple function: <code>html\`... ${html('**bold**')} ...\`</code>
- `number` - will be converted to string and appended to plain text as-is
- `TextWithEntities` or `MessageEntity` - will add the text and its entities to the output. This is the type returned by `html` itself:
  ```ts
  const bold = html`**bold**`
  const text = html`Hello, ${bold}!`
  ```
- falsy value (i.e. `null`, `undefined`, `false`) - will be ignored

Note that because of interpolation, you almost never need to think about escaping anything,
since the values are not even parsed as HTML, and are appended to the output as-is.
rename back to mtcute idk lol 2021-08-05 20:38:24 +03:00			`# @mtcute/html-parser`
Initial commit 2021-04-08 12:19:38 +03:00
docs: updated packages readmes 2023-11-01 14:05:45 +03:00			`📖 [API Reference](https://ref.mtcute.dev/modules/_mtcute_html_parser.html)`
Initial commit 2021-04-08 12:19:38 +03:00
docs: updated packages readmes 2023-11-01 14:05:45 +03:00			`HTML entities parser for mtcute`
Initial commit 2021-04-08 12:19:38 +03:00
feat(html-parser): added support for custom emojis 2022-08-25 20:17:25 +03:00			`> NOTE: The syntax implemented here is incompatible with Bot API _HTML_.`
Initial commit 2021-04-08 12:19:38 +03:00			`>`
			`> Please read [Syntax](#syntax) below for a detailed explanation`

docs: updated packages readmes 2023-11-01 14:05:45 +03:00			`## Features`
			`- Supports all entities that Telegram supports`
			`- Supports nested entities`
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			`- Proper newline/whitespace handling (just like in real HTML)`
			`- [Interpolation](#interpolation)!`
docs: updated packages readmes 2023-11-01 14:05:45 +03:00
Initial commit 2021-04-08 12:19:38 +03:00			`## Usage`

docs: updated packages readmes 2023-11-01 14:05:45 +03:00			```ts
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			`import { html } from '@mtcute/html-parser'`
Initial commit 2021-04-08 12:19:38 +03:00
			`tg.sendText(`
			`'me',`
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			html`
			`Hello, <b>me</b>! Updates from the feed:<br>`
			`${await getUpdatesFromFeed()}`
			`
Initial commit 2021-04-08 12:19:38 +03:00			`)`
			```

			`## Syntax`

rename back to mtcute idk lol 2021-08-05 20:38:24 +03:00			`@mtcute/html-parser` uses [`htmlparser2`](https://www.npmjs.com/package/htmlparser2) under the hood, so the parser
Initial commit 2021-04-08 12:19:38 +03:00			`supports nearly any HTML. However, since the text is still processed in a custom way for Telegram, the supported subset`
			`of features is documented below:`

feat(html): big rework, process html similar to browsers 2022-05-06 00:05:21 +03:00			`## Line breaks and spaces`
fix(html): added htm alias for prettier users 2021-07-23 23:03:03 +03:00
feat(html): big rework, process html similar to browsers 2022-05-06 00:05:21 +03:00			Line breaks are not preserved, `<br>` is used instead,
			`making the syntax very close to the one used when building web pages.`
fix(html): added htm alias for prettier users 2021-07-23 23:03:03 +03:00
fix: support `<tg-emoji>` and `tg-spoiler` in html parser 2023-09-18 03:40:20 +03:00			Multiple spaces and indents are collapsed (except in `pre`), when you do need multiple spaces use ` ` instead.
fix(html): added htm alias for prettier users 2021-07-23 23:03:03 +03:00
Initial commit 2021-04-08 12:19:38 +03:00			`## Inline entities`

			`Inline entities are entities that are in-line with other text. We support these entities:`

fix: support `<tg-emoji>` and `tg-spoiler` in html parser 2023-09-18 03:40:20 +03:00			`\| Name \| Code \| Result (visual) \|`
			`\| ---------------- \| ---------------------------------------------------------------- \| ---------------------------- \|`
			\| Bold \| `<b>text</b>` \| text \|
			\| Italic \| `<b>text</b>` \| _text_ \|
			\| Underline \| `<u>text</u>` \| <u>text</u> \|
			\| Strikethrough \| `<s>text</s>` \| ~~text~~ \|
			\| Spoiler \| `<spoiler>text</spoiler>` (or `tg-spoiler`) \| N/A \|
			\| Monospace (code) \| `<code>text</code>` \| `text` \|
			\| Text link \| `<a href="https://google.com">Google</a>` \| [Google](https://google.com) \|
			\| Text mention \| `<a href="tg://user?id=1234567">Name</a>` \| N/A \|
			\| Custom emoji \| `<emoji id="12345">😄</emoji>` (or `<tg-emoji emoji-id="...">`) \| N/A \|
Initial commit 2021-04-08 12:19:38 +03:00
			> Note: `<strong>`, `<em>`, `<ins>`, `<strike>`, `<del>` are not supported because they are redundant

			`> Note: It is up to the client to look up user's input entity by ID for text mentions.`
			`> In most cases, you can only use IDs of users that were seen by the client while using given storage.`
			`>`
			`> Alternatively, you can explicitly provide access hash like this:`
			> `<a href="tg://user?id=1234567&hash=abc">Name</a>`, where `abc` is user's access hash
fix: support `<tg-emoji>` and `tg-spoiler` in html parser 2023-09-18 03:40:20 +03:00			`> written as a hexadecimal integer. Order of the parameters does matter, i.e.`
Initial commit 2021-04-08 12:19:38 +03:00			> `tg://user?hash=abc&id=1234567` will not be processed as expected.

			`## Block entities`

			The only block entity that Telegram supports is `<pre>`, therefore it is the only tag we support too.

			Optionally, language for `<pre>` block can be specified like this:

			```html
			`<pre language="typescript">export type Foo = 42</pre>`
			```

feat(html): big rework, process html similar to browsers 2022-05-06 00:05:21 +03:00			`\| Code \| Result (visual) \|`
fix: support `<tg-emoji>` and `tg-spoiler` in html parser 2023-09-18 03:40:20 +03:00			`\| ----------------------------------------------------------------------------------- \| ---------------------------- \|`
feat(html): big rework, process html similar to browsers 2022-05-06 00:05:21 +03:00			`\| <pre><pre>multiline\ntext</pre></pre> \| <pre>multiline<br>text</pre> \|`
			`\| <pre><pre language="javascript"><br> export default 42<br></pre></pre> \| <pre>export default 42</pre> \|`
Initial commit 2021-04-08 12:19:38 +03:00
			`## Nested and overlapped entities`

			`HTML is a nested language, and so is this parser. It does support nested entities, but overlapped entities will not work`
			`as expected!`

			Overlapping entities are supported in `unparse()`, though.

feat(html): big rework, process html similar to browsers 2022-05-06 00:05:21 +03:00			`\| Code \| Result (visual) \|`
			`\|---------------------------------------------------------------------------------------------------------------------\|--------------------------------------------------------------------------\|`
			\| `<b>Welcome back, <i>User</i>!</b>` \| Welcome back, _User_! \|
			\| `<b>bold <i>and</b> italic</i>` \| bold _and_ italic<br>⚠️ <i>word "italic" is not actually italic!</i> \|
			\| `<b>bold <i>and</i></b><i> italic</i>`<br>⚠️ <i>this is how <code>unparse()</code> handles overlapping entities</i> \| bold _and_ _italic_ \|
Initial commit 2021-04-08 12:19:38 +03:00
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			`## Interpolation`
Initial commit 2021-04-08 12:19:38 +03:00
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			Being a tagged template literal, `html` supports interpolation.
Initial commit 2021-04-08 12:19:38 +03:00
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			`You can interpolate one of the following:`
			- `string` - will not be parsed, and appended to plain text as-is
			- In case you want the string to be parsed, use `html` as a simple function: <code>html\`... ${html('bold')} ...\`</code>
			- `number` - will be converted to string and appended to plain text as-is
			- `TextWithEntities` or `MessageEntity` - will add the text and its entities to the output. This is the type returned by `html` itself:
			```ts
chore: migrate to antfu eslint config (+ reformat) 2024-08-13 04:53:07 +03:00			const bold = html`bold`
			const text = html`Hello, ${bold}!`
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			```
			- falsy value (i.e. `null`, `undefined`, `false`) - will be ignored
feat: html and markdown tagged template helpers 2021-07-02 20:20:29 +03:00
refactor: no more parse modes! 2023-11-01 20:24:00 +03:00			`Note that because of interpolation, you almost never need to think about escaping anything,`
chore: migrate to antfu eslint config (+ reformat) 2024-08-13 04:53:07 +03:00			`since the values are not even parsed as HTML, and are appended to the output as-is.`