Extending ctags with Regex parser(optlib)

Maintainer:Masatake YAMATO <yamato@redhat.com>

Writing regex parser and using it as option library(optlib)

exuberant-ctags provides the way to customize ctags with options like --langdef=<LANG> and --regex-<LANG>. An option file where options are written can be loaded with --options=OPTION_FILE.

This feature was extended such that ctags treats option files as libraries. Developers of universal-ctags can maintain option files as part of universal-ctags, making part of its release. With make install they are also installed along with ctags command.

universal-ctags prepares directories where the option files are installed.

Consider a GNU/Linux distribution. The following directories are searched when loading an option file:

  1. ~/.ctags.d/optlib
  2. /etc/ctags/optlib
  3. /usr/share/ctags/optlib

The name of an option file must have .conf or .ctags as suffix.

If ctags is invoked with following command line:

$ ctags --options=m4 ...

Following files are searched with following order for finding m4:

  1. ~/.ctags.d/optlib/m4.conf
  2. ~/.ctags.d/optlib/m4.ctags
  3. /etc/ctags/optlib/m4.conf
  4. /etc/ctags/optlib/m4.ctags
  5. /usr/share/ctags/optlib/m4.conf
  6. /usr/share/ctags/optlib/m4.ctags

These are called built-in search paths.

If these search paths are not desired, the full path of the option file can be directly specified with --options. The parameter must start with / (absolute path) or ./ (relative path) like:

$ ctags --option=/home/user/test/m4.cf
$ ctags --option=./test/m4.cf

Here the suffix restriction doesn’t exist.

On GNU/Linux more directories can be added with the environment variable CTAGS_DATA_PATH.

$ CTAGS_DATA_PATH=A:B ctags --options=m4 ...

The files are searched with the order described below for finding m4:

  1. A/optlib/m4.conf
  2. A/optlib/m4.ctags
  3. B/optlib/m4.conf
  4. B/optlib/m4.ctags
  5. ~/.ctags.d/optlib/m4.conf
  6. ...

Further more --data-path=[+]PATH can be used for adding more directories with environment variable:

$ CTAGS_DATA_PATH=A:B ctags --data-path=+C --options=m4 ...

In this case files are searched with the following order to find m4:

  1. C/optlib/m4.conf
  2. C/optlib/m4.ctags
  3. A/optlib/m4.conf
  4. A/optlib/m4.ctags
  5. B/optlib/m4.conf
  6. B/optlib/m4.ctags
  7. ~/.ctags.d/optlib/m4.conf
  8. ...

If + is omitted, the directory is set instead of added:

$ CTAGS_DATA_PATH=A:B ctags --data-path=C --options=m4 ...

In this case files are searched with the following order to find m4:

  1. C/config/m4.conf
  2. C/config/m4.ctags

The directory list can be emptied using the reserved file name NONE:

$ CTAGS_DATA_PATH=A:B ctags --data-path=NONE --options=m4 ...

In this case ctags only tries to load ./m4.

See also “Loading option recursively”.

How a directory is set/added to the search path can be reviewed using --verbose option. This is useful for debugging this feature.

Pull requests with updated or new option files are welcome by ctags developers.

NOTE: Although --data-path has highest priority, --data-path doesn’t affect a stage of automatic option file loading. Following files are automatically loaded when ctags starts:

  1. /ctags.cnf (on MSDOS, MSWindows only)
  2. /etc/ctags.conf
  3. /usr/local/etc/ctags.conf
  4. $HOME/.ctags
  5. $HOME /ctags.cnf (on MSDOS, MSWindows only)
  6. .ctags
  7. ctags.cnf (on MSDOS, MSWindows only)

NOTE: This feature is still experimental. The name of directories, suffix rules and other conventions may change.

See “Contributing an optlib” if you have a good optlib.

Loading option recursively

The option file loading rules explained in “Option library” is more complex. If a directory is specified as parameter for --option instead of a file, universal-ctags loads option files under the directory recursively.

Consider the following command line on a GNU/Linux distribution:

$ ctags --options=bundle ...

The following directories are searched first:

  1. ~/.ctags.d/optlib/bundle.d
  2. /etc/ctags/optlib/bundle.d
  3. /usr/share/ctags/optlib/bundle.d

If bundle.d is found and is a directory, files (*.ctags and *.conf), directories (*.d) are loaded recursively.

NOTE: If bundle.d is not found above list, file bundle.ctags or bundle.conf is searched. This rule is a bit ugly. Following search rules look better.

  1. ~/.ctags.d/optlib/bundle.d
  2. ~/.ctags.d/optlib/bundle.ctags
  3. ~/.ctags.d/optlib/bundle.conf
  4. /etc/ctags/optlib/bundle.d
  5. /etc/ctags/optlib/bundle.ctags
  6. /etc/ctags/optlib/bundle.conf
  7. /usr/share/ctags/optlib/bundle.d
  8. /usr/share/ctags/optlib/bundle.ctags
  9. /usr/share/ctags/optlib/bundle.conf

NOTE: This feature requires scandir library function. This feature may be disabled on which platform scandir is not available. Check option-directory in the supported features:

$ ./ctags --list-features --with-list-header=no

Directories for preloading

As written in “Option library”, option libraries can be loaded with --options option. However, loading them without explicitly specifying it may be desired.

Following files can be used for this purpose.

  • ~/.ctags
  • /ctags.cnf (on MSDOS, MSWindows only)
  • /etc/ctags.conf
  • /usr/local/etc/ctags.conf

This preloading feature comes from universal-ctags. However, two weaknesses exist in this implementation.

  • The file must be edited when an option library is to be loaded.

    If one wants to add or remove an --options= in a ctags.conf, currently one may have to use sed or something tool for adding or removing the line for the entry in /usr/local/etc/ctags.conf (or /etc/ctags.conf).

    There is a discussion about a similar issue in http://marc.info/?t=129794755000003&r=1&w=2 about /etc/exports of NFS.

  • The configuration defined by the system administrator cannot be overridden.

    A user must accept all configuration including --options= in /etc/ctags.conf and /usr/local/etc/ctags.conf.

The following directories were introduced for preloading purpose.

  1. ~/.ctags.d/preload
  2. /etc/ctags/preload
  3. /usr/share/ctags/preload

All files and directories under the directories are loaded recursively, with two restrictions:

  • file/directory name

    The same suffix rules written in “Option library” and “Loading option recursively” are applied in preloading, too.

  • overriding

    The traversing and loading are done in the order listed above. Once a file is loaded, another file with the same name is not loaded. Once a directory is traversed, another directory with the same name is not traversed.

    universal-ctags prepares /usr/share/ctags/preload/default.ctags. If you want ctags not to load it, make an empty file at ~/.ctags/default.ctags. To customize /usr/share/ctags/preload/default.ctags, copy the file to ~/.ctags.d/default.ctags and edit it as desired.

    Assume /usr/share/ctags/preload/something.d exits. Some .ctags files are in the directory. With making an empty directory at ~/.ctags.d/something.d, you can make ctags not to traverse /usr/share/ctags/preload/something.d. As the result .ctags files under /usr/share/ctags/preload/something.d are not loaded.

    To customize one of file under /usr/share/ctags/preload/something.d, copy /usr/share/ctags/preload/something.d to ~/.ctags.d/something.d recursively. Symbolic links can also be used. After copying or symbolic linking, edit one of the copied file.

This feature is heavily inspired by systemd.

Long regex flag

Regex parser is made more useful by adding more kinds of flags to --regex-<LANG> expression. As explained in ctags.1 man page, b, e and i are defined as flags in exuberant-ctags.

Even if more flags are added like x, y, z,..., users may not utilize them well because it is difficult to memorize them. In addition, if many “option libraries” are contributed, we have to maintain them.

For both users and developers the variety of short flags are just nightmares.

So universal-ctags now includes an API for defining long flags, which can be used as aliases for short flags. The long flags requires more typing but are more readable.

Here is the mapping between the standard short flag names and long flag names:

short flag long flag
b basic
e extend
i icase

Long flags can be specified with surrounding { and }. So the following --regex-<LANG> expression


is the same as


The characters { and } may not be suitable for command line use, but long flags are mostly intended for option libraries.

The notion for the long flag is also introduced in --langdef option.

Exclusive flag in regex

A line read from input files was matched with all regular expressions defined with --regex-<LANG>. Each regular expression matched successfully emits a tag.

In some cases another policy, exclusive-matching, is preferable to the all-matching policy. Exclusive-matching means the rest of regular expressions are not tried if one of regular expressions is matched successfully,

For specifying exclusive-matching the flags exclusive (long) and x (short) were introduced. It is used in data/optlib/m4.ctags for ignoring a line:


Comments are started from # or dnl in many use case of m4 language. With above options ctags can ignore define in comments.

If an empty name pattern(//) is found in --regex-<LANG> option ctags warns it as wrong usage of the option. However, the flags exclusive or x is specified, the warning is suppressed. This is imperfect approach for ignoring text insides comments but it may be better than nothing. Ghost kind is assigned to the empty name pattern. (See “Ghost kind in regex parser”.)

NOTE: This flag doesn’t make sense in --mline-regex-<LANG>.

Ghost kind in regex parser

If a whitespace is used as a kind letter, it is never printed when ctags is called with --list-kinds option. This kind is automatically assigned to an empty name pattern.

Normally you don’t need to know this.

Passing parameter for long regex flag

In the implemented API long-flags can take a parameters. Conceptual example:


Scope tracking in a regex parser

With scope long flag, you can record/track scope context. A stack is used for tracking the scope context.


Push the tag captured with a regex pattern to the top of the stack. If you don’t want to record this tag but just push, use placeholder long option together.


Refer the thing of top of the stack as a scope where the tag captured with a regex pattern is. The stack is not modified with this specification. If the stack is empty, this flag is just ignored.


Pop the thing of top of the stack. If the stack is empty, this flag is just ignored.


Make the stack empty.


Clear then push.


Don’t print a tag captured with a regex pattern to a tag file. This is useful when you need to push non-named context information to the stack. Well known non-named scope in C language is established with {. non-named scope is never appeared in tags file as name or scope name. However, pushing it is important to balance push and pop.

Example 1:

$ cat /tmp/input.foo
class foo:
    def bar(baz):
class goo:
    def gar(gaz):

$ cat /tmp/foo.ctags

$ ~/var/ctags/ctags --options=/tmp/foo.ctags -o - /tmp/input.foo
bar /tmp/input.foo  /^    def bar(baz):$/;" d       class:foo
foo /tmp/input.foo  /^class foo:$/;"        c
gar /tmp/input.foo  /^    def gar(gaz):$/;" d       class:goo
goo /tmp/input.foo  /^class goo:$/;"        c

Example 2:

$ cat /tmp/input.pp
class foo {
    include bar

$ cat /tmp/pp.ctags

$ ~/var/ctags/ctags --options=/tmp/pp.ctags -o - /tmp/input.pp
bar /tmp/input.pp   /^    include bar$/;"   i       class:foo
foo /tmp/input.pp   /^class foo {$/;"       c

NOTE: Giving a scope long flag implies setting useCork of the parser to TRUE. See cork API.

NOTE: This flag doesn’t work well with --mline-regex-<LANG>=.

Override the letter for file kind

(See also #317.)

Overriding the letter for file kind is not allowed in Universal-ctags. Don’t use F as a kind letter in your parser.

Multiline pattern match

Newly introduced --mline-regex-<LANG>= is similar --regex-<LANG> but the pattern is applied to whole file contents, not line by line.

Next example is based on an issue #219 posted by @andreicristianpetcu:

$ cat input.java
public void catchEvent(SomeEvent e)

public void
    recover(Exception e)

$ cat spring.ctags
--mline-regex-javaspring=/@Subscribe([[:space:]])*([a-z ]+)[[:space:]]*([a-zA-Z]*)\(([a-zA-Z]*)/\3-\4/s,subscription/{mgroup=3}

$ ./ctags -o - --options=./spring.ctags input.java
Event-SomeEvent     input.java      /^public void catchEvent(SomeEvent e)$/;"       s       line:2  language:javaspring
recover-Exception   input.java      /^    recover(Exception e)$/;"  s       line:10 language:javaspring


This tells the pattern should be applied to whole file contents, not line by line. N is the number of a group in the pattern. The specified group is used to record the line number and the pattern of tag. In the above example 3 is specified. The start position of the group 3 within the whole file contents is used.


A pattern is applied to whole file contents iteratively. This long flag specifies from where the pattern should be applied in next iteration when the pattern is matched. When a pattern matches, the next pattern application starts from the start or end of group N. By default it starts from the end of N. If this long flag is not given, 0 is assumed for N.

Let’s think about following input

def def abc

Consider two sets of options, foo and bar.


  --mline-regex-foo=/def *([a-z]+)/\1/a/{mgroup=1}


       --mline-regex-bar=/def *([a-z]+)/\1/a/{mgroup=1}{_advanceTo=1start}

*foo.ctags* emits following tags output::

  def  input.foo       /^def def abc$/;"       a

*bar.ctgs* emits following tags output::

  def  input-0.bar     /^def def abc$/;"       a
  abc  input-0.bar     /^def def abc$/;"       a

``_advanceTo=1start`` is specified in *bar.ctags*.
That causes ctags allow to capture "abc".

At the first iteration, the patterns of both
*foo.ctags* and *bar.ctags" match as follows
       0   1       (start)
v v
def def abc
0,1 (end)

“def” at the group 1 is captured as a tag in the both languages. At the next iteration, the positions where the pattern matching is applied to are not the same in the language.


0end (default)
 def def abc


1start (as specified in _advanceTo long flag)
    def def abc

This difference of positions makes the difference of tags output.

NOTE: This flag doesn’t work well with scope related flags and exclusive flags.

Byte oriented pattern matching with multiple regex tables

(This is highly experimental feature. This will not go to the man page of 6.0.)

–_tabledef-<LANG> and –_mtable-regex-<LANG> options are experimental, and are for defining a parser using multiple regex tables. The feature is inspired by lex, the fast lexical analyzer generator, which is a popular tool on Unix environment for writing a parser, and RegexLexer of Pygments. The knowledge about them help you understand the options.

As usable, let me explain the feature with an example. Consider a imaginary language “X” has similar syntax with JavaScript; “var” is used as defining variable(s), , and “/* ... */” makes block comment.


var dont_capture_me;

Here ctags should capture a and b. It is difficult to write a parser ignoring dont_capture_me in the comment with a classical regex parser defined with –regex-<LANG>=.

A classical regex parser has no way to know where the input is in comment or not.

A classical regex parser is line oriented, so capturing b will be hard.

A parser written with –_tabledef-<LANG> and –_mtable-regex-<LANG> option(mtable parser) can capture only a and b well.

Here is the 1st version of X.ctags.


Not so interesting.

When writing a mtable parser, you have to think about necessary states of parsing. About the input the parser should have following states.

  • toplevel (initial state)
  • comment (inside comment)
  • vars (var statements)

Before enumerating regular expressions, you have to declare tables for each states with –_tabledef-<LANG>=<TABLE> option:

Here is the 2nd version of X.ctags.



As the part of table, chars in [0-9a-zA-Z_] are acceptable. A mtable parser chooses the first table for each new input. In X.ctags, toplevel is the one.

–_mtable-regex-<LANG> is an option for adding a regex pattern to table.


Parameters for –_mtable-regex-<LANG> looks complicated. However, <PATTERN>, <NAME>, and <KIND> are the same as parameters of –regex-<LANG>. <TABLE> is the name of a table defined with –_tabledef-<LANG> option.

A regex added to a parser with –_mtable-regex-<LANG> is matched against the input at the current byte position, not line. Even if you do not specified ^ at the start of the pattern, ctags adds ^ to the patter automatically. Different from –regex-<LANG> option, ^ does not mean “begging of line” in –_mtable-regex-<LANG>. ^ means the current byte position in –_mtable-regex-<LANG>.

Skipping block comments

The most interesting part if LONGFLAGS.

Here is the 3rd version of X.ctags.





Four –_mtable-regex-X liens are added for skipping the block comment.

Let’s see the one by one.

For new input, ctags chooses the first pattern of the first table of the parser.


A pattern for /* is added to toplevel table. It tells ctags the start of block comment. Backslash chars are used for avoiding chars (/ and *) evaluated as meta characters. The last // means ctags should not tag /*. tenter is a long flag for switching the table. {tenter=comment} means “switch the table from toplevel to comment”.

ctags chooses the first pattern of the new table of the parser.


A pattern for */ tells ctags that */ is the end of block comment.


var dont_capture_me;

The pattern doesn’t match for the position just after /*. The char at the position is a whitespace. So ctags tries next pattern in the same table.


This pattern matches any one byte; the current position moves one byte forward. Now the char at the current position is B. The first pattern of the table */ still does not match with the input. So ctags uses next pattern again. When the current position moves to the /* of the 3rd line of input.


The pattern match the input finally. In this pattern, {tleave} is specified. This triggers table switching again. {tleave} makes ctags switch the table back to the last table used before doing {tenter}. In this case, toplevel is the table. ctags manages a stack where references to tables are put. {tenter} pushes the current table to the stack. {tleave} pops the table at the top of the stack and chooses it.


This version of X.ctags does nothing more; toplevel table ignores all other than the comment starter.

Capturing variables in a sequence

Here is the 4th version of X.ctags.



--_mtable-regex-X=toplevel/var[ \n\t]//{tenter=vars}



1 pattern to toplevel and 4 patterns to vars are added.

–_mtable-regex-X=toplevel/var[ nt]//{tenter=vars}

The first pattern to toplevel intents switching to vars table when var keyword is found in the input stream.


vars table is for capturing variables. vars table is used till ; is found.


Block comments can be in variable definitions:


To skip block comment in such position, pattern /* is matched even in vars table.


This is nothing special: capturing a variable name as variable kind tag.


This makes ctags ignore the rest like ,.


$ cat input.x
cat input.x
var dont_capture_me;

$ u-ctags -o - --fields=+n --options=X.ctags input.x
u-ctags -o - --fields=+n --options=X.ctags input.x
a       input.x /^var a \/* ANOTHER BLOCK COMMENT *\/, b;$/;"   v       line:4
b       input.x /^var a \/* ANOTHER BLOCK COMMENT *\/, b;$/;"   v       line:4


See puppetManifest parser as s serious example. It is the primary parser for testing mtable meta parser.

Conditional tagging with extras

If a pattern matching should be done only when an extra is enabled, mark a pattern with {_extra=XNAME}. Here XNAME is the name of extra. You must define XNAME with --extradef-<LANG>=XNAME,DESCRIPTION option before defining a pattern marked {_extra=XNAME}.

if __name__ == '__main__':

To capture above lines in a python program(input.py), an extra can be used.

--extradef-Python=main,__main__ entry points
--regex-Python=/^if __name__ == '__main__':/__main__/f/{_extra=main}

The above optlib(python-main.ctags) introduces main extra to Python parser. The pattern matching is done only when the main is enabled.

$ ./ctags --options=python-main.ctags -o - --extras-Python='+{main}' input.py
__main__        input.py        /^if __name__ == '__main__':$/;"        f

Attaching parser own fields

Exuberant-ctags allows one of the specified group in a regex pattern can be used as a part of the name of a tagEntry. Universal-ctags offers using the other groups in the regex pattern.

An optlib parser can have its own fields. The groups can be used as a value of the fields of a tagEntry.

Let’s think about Unknown, an imaginary language. Here is a source file(input.unknown) written in Unknown:

public func foo(n, m); protected func bar(n); private func baz(n,...);

With –regex-Unknown=... Exuberant-ctags can capture foo, bar, and baz as names. Universal-ctags can attach extra context information to the names as values for fields. Let’s focus on bar. protected is a keyword to control how widely the identifier bar can be accessed. (n) is the parameter list of bar. protected and (n) are extra context information of bar.

With following optlib file(unknown.ctags)), ctags can attach protected to protection field and (n) to signature field.


--_fielddef-unknown=protection,access scope

--regex-unknown=/^((public|protected|private) +)?func ([^\(]+)\((.*)\)/\3/f/{_field=protection:\1}{_field=signature:(\4)}


For the line ` protected func bar(n);` you will get following tags output:

bar     input.unknown   /^protected func bar(n);$/;"    f       protection:protected    signature:(n)

Let’s see the detail of unknown.ctags.

--_fielddef-unknown=protection,access scope

–_fielddef-<LANG>=name,description defines a new field for a parser specified by <LANG>. Before defining a new field for the parser, the parser must be defined with –langdef=<LANG>. protection is the field name used in tags output. access scope is the description used in the output of --list-fields and --list-fields=Unknown.


This defines a field named signature.

--regex-unknown=/^((public|protected|private) +)?func ([^\(]+)\((.*)\)/\3/f/{_field=protection:\1}{_field=signature:(\4)}

This option requests making a tag for the name that is specified with the group 3 of the pattern, attaching the group 1 as a value for protection field to the tag, and attaching the group 4 as a value for signature field to the tag. You can use the long regex flag _field for attaching fields to a tag with following notation rule:


–fields-<LANG>=[+|-]{FIELDNAME} can be used to enable or disable specified field.

When defining a new parser own field, it is disabled by default. Enable the field explicitly to use the field. See Parser own fields about –fields-<LANG> option.

passwd parser is a simple example that uses –fields-<LANG> option.

Submitting an optlib to universal-ctags project

You are welcome.

universal-ctags provides a facility for “Option library”. Read “Option library” about the concept and usage first.

Here I will explain how to merge your .ctags into universal-ctags as part of option library. Here I assume you consider contributing an option library in which a regex based language parser is defined. See How to Add Support for a New Language to Exuberant Ctags (EXTENDING) about the way to how to write a regex based language parser. In this section I explains the next step.

I use Swine as the name of programming language which your parser deals with. Assume source files written in Swine language have a suffix .swn. The file name of option library is swine.ctags.

Units test cases

We, universal-ctags developers don’t have enough time to learn all languages supported by ctags. In other word, we cannot review the code. Only test cases help us to know whether a contributed option library works well or not. We may reject any contribution without a test case.

Read “Using Units” about how to write Units test cases. Don’t write one big test case. Some smaller cases are helpful to know about the intent of the contributor.

  • Units/sh-alias.d
  • Units/sh-comments.d
  • Units/sh-quotes.d
  • Units/sh-statements.d

are good example of small test cases. Big test cases are good if smaller test cases exist.

See also parser-m4.r/m4-simple.d especially parser-m4.r/m4-simple.d/args.ctags. Your test cases need ctags having already loaded your option library, swine.ctags. You must specify loading it in the test case own args.ctags.

Assume your test name is swine-simile.d. Put --option=swine in Units/swine-simile.d/args.ctags.


Add your optlib file, swine.ctags to PRELOAD_OPTLIB variable of Makefile.in.

If you don’t want your optlib loaded automatically when ctags starting up, put your optlib file to OPTLIB of Makefile.in instead of PRELOAD_OPTLIB.


Let’s verify all your work here.

  1. Run the tests and check whether your test case is passed or failed:

    $ make units
  2. Verify your files are installed as expected:

    $ mkdir /tmp/tmp
    $ ./configure --prefix=/tmp/tmp
    $ make
    $ make install
    $ /tmp/tmp/ctags -o - --option=swine something_input.swn


Remember your .ctags is treasure and can be shared as a first class software component in universal-ctags. Again, pull-requests are welcome.