Tempest is still a work in progress. Visit our GitHub or Discord

Building a custom language

Yesterday, I wrote about the why of making a new syntax highlighter. Today I want to write about the how.

Let's explain how tempest/highlight works by implementing a new language — Blade is a good candidate. It looks something like this:

@if(! empty($items))
    <div class="container">
        Items: {{ count($items) }}.
    </div>
@endslot

In order to build such a new language, you need to understand three concepts of how code is highlighted: patterns, injections, and languages.

1. Patterns

A pattern represents part of code that should be highlighted. A pattern can target a single keyword like return or class, or it could be any part of code, like for example a comment: /* this is a comment */ or an attribute: #[Get(uri: '/')].

Each pattern is represented by a simple class that provides a regex pattern, and a TokenType. The regex pattern is used to match relevant content to this specific pattern, while the TokenType is an enum value that will determine how that specific pattern is colored.

Here's an example of a simple pattern to match the namespace of a PHP file:

use Tempest\Highlight\IsPattern;
use Tempest\Highlight\Pattern;
use Tempest\Highlight\Tokens\TokenType;

final readonly class NamespacePattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return 'namespace (?<match>[\w\\\\]+)';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::TYPE;
    }
}

Note that each pattern must include a regex capture group that's named match. The content that matched within this group will be highlighted.

For example, this regex namespace (?<match>[\w\\\\]+) says that every line starting with namespace should be taken into account, but only the part within the named group (?<match>…) will actually be colored. In practice that means that the namespace name matching [\w\\\\]+, will be colored.

Yes, you'll need some basic knowledge of regex. Head over to https://regexr.com/ if you need help, or take a look at the existing patterns in this repository.

In summary:

  • Pattern classes provide a regex pattern that matches parts of code.
  • Those regexes should contain a group named match, which is written like so (?<match>…), this group represents the code that will actually be highlighted.
  • Finally, a pattern provides a TokenType, which is used to determine the highlight style for the specific match.

2. Injections

Once you've understood patterns, the next step is to understand injections. Injections are used to highlight different languages within one code block. For example: HTML could contain CSS, which should be styled properly as well.

An injection will tell the highlighter that it should treat a block of code as a different language. For example:

<div>
    <x-slot name="styles">
        <style>
            body {
                background-color: red;
            }
        </style>
    </x-slot>
</div>

Everything within <style></style> tags should be treated as CSS. That's done by injection classes:

use Tempest\Highlight\Highlighter;
use Tempest\Highlight\Injection;
use Tempest\Highlight\IsInjection;
use Tempest\Highlight\ParsedInjection;

final readonly class CssInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '<style>(?<match>(.|\n)*)<\/style>';
    }

    public function parseContent(string $content, Highlighter $highlighter): ParsedInjection
    {
        return new ParsedInjection(
            content: $highlighter->parse($content, 'css')
        );
    }
}

Just like patterns, an injection must provide a pattern. This pattern, for example, will match anything between style tags: <style>(?<match>(.|\n)*)<\/style>.

The second step in providing an injection is to parse the matched content into another language. That's what the parseContent() method is for. In this case, we'll get all code between the style tags that was matched with the named (?<match>…) group, and parse that content as CSS instead of whatever language we're currently dealing with.

In summary:

  • Injections provide a regex that matches a blob of code of language A, while in language B.
  • Just like patterns, injection regexes should contain a group named match, which is written like so: (?<match>…).
  • Finally, an injection will use the highlighter to parse its matched content into another language.

3. Languages

The last concept to understand: languages are classes that bring patterns and injections together. Take a look at the HtmlLanguage, for example:

class HtmlLanguage extends BaseLanguage
{
    public function getName(): string
    {
        return 'html';
    }
    
    public function getAliases(): array
    {
        return ['htm', 'xhtml'];
    }
    
    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
            new PhpInjection(),
            new PhpShortEchoInjection(),
            new CssInjection(),
            new CssAttributeInjection(),
        ];
    }

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
            new OpenTagPattern(),
            new CloseTagPattern(),
            new TagAttributePattern(),
            new HtmlCommentPattern(),
        ];
    }
}

This HtmlLanguage class specifies the following things:

  • PHP can be injected within HTML, both with the short echo tag <?= and longer <?php tags
  • CSS can be injected as well, JavaScript support is still work in progress
  • There are a bunch of patterns to highlight HTML tags properly

On top of that, it extends from BaseLanguage. This is a language class that adds a bunch of cross-language injections, such as blurs and highlights. Your language doesn't need to extend from BaseLanguage and could implement Language directly if you want to.

With these three concepts in place, let's bring everything together to explain how you can add your own languages.

Adding custom languages

So we're adding Blade support. We could create a new language class and start from scratch, but it'd probably be easier to extend an existing language, HtmlLanguage is probably the best. Let create a new BladeLanguage class that extends from HtmlLanguage:

class BladeLanguage extends HtmlLanguage
{
    public function getName(): string
    {
        return 'blade';
    }
    
    public function getAliases(): array
    {
        return [];
    }
    
    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
        ];
    }

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
        ];
    }
}

With this class in place, we can start adding our own patterns and injections. Let's start with adding a pattern that matches all Blade keywords, which are always prepended with the @ sign. Let's add it:

final readonly class BladeKeywordPattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return '(?<match>\@[\w]+)\b';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::KEYWORD;
    }
}

And register it in our BladeLanguage class:

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
            new BladeKeywordPattern(),
        ];
    }

Next, there are a couple of places within Blade where you can write PHP code: within the @php keyword, as well as within keyword brackets: @if (count(…)). Let's write two injections for that:

final readonly class BladePhpInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '\@php(?<match>(.|\n)*?)\@endphp';
    }

    public function parseContent(string $content, Highlighter $highlighter): ParsedInjection
    {
        return new ParsedInjection(
            content: $highlighter->parse($content, 'php')
        );
    }
}
final readonly class BladeKeywordInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '(\@[\w]+)\s?\((?<match>.*)\)';
    }

    public function parseContent(string $content, Highlighter $highlighter): ParsedInjection
    {
        return new ParsedInjection(
            content: $highlighter->parse($content, 'php')
        );
    }
}

Let's add these to our BladeLanguage class as well:

    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
            new BladePhpInjection(),
            new BladeKeywordInjection(),
        ];
    }

Next, you can write {{ … }} and {!! … !!} to echo output. Whatever is between these brackets is also considered PHP, so, one more injection:

final readonly class BladeEchoInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '({{|{!!)(?<match>.*)(}}|!!})';
    }

    public function parseContent(string $content, Highlighter $highlighter): ParsedInjection
    {
        return new ParsedInjection(
            content: $highlighter->parse($content, 'php')
        );
    }
}

And, finally, you can write Blade comments like so: {{-- --}}, this can be a simple pattern:

final readonly class BladeCommentPattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return '(?<match>\{\{\-\-(.|\n)*?\-\-\}\})';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::COMMENT;
    }
}

With all of that in place, the only thing left to do is to add our language to the highlighter:

$highlighter->addLanguage(new BladeLanguage());

And we're done! Blade support with just a handful of patterns and injections!