Skip to content

Home

DOI pipeline status Latest Release License: GPL v3

About

ParaFin is a collection of Finnish nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.

The data is encoded in csv files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard

Please cite as:

  • Jules Bouton. ParaFin: Finnish Paradigms in Phonemic Notation. 2024. doi:10.5281/zenodo.13736131.
  • Jules Bouton. Towards standardized inflected lexicons for the Finnic languages. In Proceedings of the Ninth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2024),. Helsinki, Finland, 2024. Association for Computational Linguistics. (forthcoming)

The data can be downloaded from zenodo or from the gitlab repository.

How this lexicon was prepared

We selected the 5000 most frequent nouns according to the LASTU dataset and produced their inflectional paradigms with the Omorfi software. We used epitran rules to convert these paradigms into phonemic notation. We performed extensive manual verifications. The input to epitran rules are the annotated orthographic forms, where:

  • - a minus indicates orthographical hyphen boundaries for composita
  • # a hash indicates boundaries for composita (except when written with a hyphen, see above)
  • + a plus indicates the end of the immutable part of the stem
  • ˣ a superscript x indicates a morph that triggers sanddhi
  • ' a straight apostrophy indicates an intervocalic glottal feature

Finally, we enriched the dataset with annotations for overabundance, defectivity, cells and features.

Summary

flowchart LR
    A[(LASTU)]:::start ==> Z(Frequent
                            lexemes)
    A -.->|Token frequencies| F
    Z ===> O
    O{{Omorfi}}:::start ==> B

    B(Orthographic
            paradigms) ==> X
    E[["🖋 G2P rules"]]:::add -.-> X
    X{{Epitran}}:::start ==> C
    C(Phonemic
        paradigms) ==> D[(Paralex
            dataset)]:::aim
    F[["🖋 Rich annotations"]]:::add --> D

classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a

How to re-generate the data

To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:

$ git clone https://gitlab.com/finnic-morpho/parafin.git
$ cd parafin
$ make all

Please note that some tables (as cells, features, tags) need to be created manually and are required to build the other tables. The different steps of the process are detailed below.

Getting the sources

You should first clone the git repository:

$ git clone https://gitlab.com/finnic-morpho/parafin.git
$ cd parafin

We first create a virtual environment with the required dependencies:

$ make venv

We need to insert our transcription rules in the right place:

$ make epitran

For the frequencies, we download an dataset from the Finnish Parsebank:

$ make osf

Then, to generate the forms, we recover two internal files from omorfi:

$ make omorfi
Building the dataset

We build the lexemes and forms table:

$ make parse

Evaluating the transcription on dev forms:

$ make evaluate

Phonological transcription:

$ make transcription

Frequencies are extracted from the Finnish Parsebank:

$ make frequencies

Overabundant and defective forms are tagged:

$ make tag
Packaging & Validation

We produce Frictionless metadata:

$ make metadata

Check the conformity with Paralex standard:

$ make validate

It is possible to export a random sample (with fixed seed), for manual verifications:

$ make sample

References

This dataset is derived from the Omorfi HFST. See:

  • Tommi A. Pirinen, Inari Listenmaa, Ryan Johnson, Francis M. Tyers, and Juha Kuokkala. Open morphology of Finnish. University of Helsinki, 2017.
  • Tommi A Pirinen. Development and Use of Computational Morphology of Finnish in the Open Source and Open Science Era: Notes on Experiences with Omorfi Development. SKY Journal of Linguistics, 28:381–393, 2015.

The frequencies are from the LASTU software Finnish dataset, derived from the Finnish Parsebank:

  • Sami Itkonen, Tuomo Häikiö, Seppo Vainio, and Minna Lehtonen. LASTU: A psycholinguistic search tool for Finnish lexical stimuli. Behavior Research Methods, 56(6):6165–6178, 2024. doi:10.3758/s13428-024-02347-x.
  • Juhani Luotolahti, Jenna Kanerva, Veronika Laippala, Sampo Pyysalo, and Filip Ginter. Towards Universal Web Parsebanks. In Joakim Nivre and Eva Hajičová, editors, Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), 211–220. Uppsala, Sweden, 2015. Uppsala University, Uppsala, Sweden.

The transcriptions were build with a modified version of Epitran's Finnish module:

  • David R. Mortensen, Siddharth Dalmia, and Patrick Littell. Epitran: Precision G2P for Many Languages. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2711–2714. Miyazaki, 2018. European Language Resources Association.