Home
About
ParaFin is a collection of Finnish nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.
The data is encoded in csv
files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard
Please cite as:
- Jules Bouton. ParaFin: Finnish Paradigms in Phonemic Notation. 2024. doi:10.5281/zenodo.13736131.
- Jules Bouton. Towards standardized inflected lexicons for the Finnic languages. In Proceedings of the Ninth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2024),. Helsinki, Finland, 2024. Association for Computational Linguistics. (forthcoming)
The data can be downloaded from zenodo or from the gitlab repository.
How this lexicon was prepared
We selected the 5000 most frequent nouns according to the LASTU dataset and produced their inflectional paradigms with the Omorfi software. We used epitran rules to convert these paradigms into phonemic notation. We performed extensive manual verifications. The input to epitran rules are the annotated orthographic forms, where:
-
a minus indicates orthographical hyphen boundaries for composita#
a hash indicates boundaries for composita (except when written with a hyphen, see above)+
a plus indicates the end of the immutable part of the stemˣ
a superscriptx
indicates a morph that triggers sanddhi'
a straight apostrophy indicates an intervocalic glottal feature
Finally, we enriched the dataset with annotations for overabundance, defectivity, cells and features.
Summary
flowchart LR
A[(LASTU)]:::start ==> Z(Frequent
lexemes)
A -.->|Token frequencies| F
Z ===> O
O{{Omorfi}}:::start ==> B
B(Orthographic
paradigms) ==> X
E[["🖋 G2P rules"]]:::add -.-> X
X{{Epitran}}:::start ==> C
C(Phonemic
paradigms) ==> D[(Paralex
dataset)]:::aim
F[["🖋 Rich annotations"]]:::add --> D
classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a
How to re-generate the data
To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:
$ git clone https://gitlab.com/finnic-morpho/parafin.git
$ cd parafin
$ make all
Please note that some tables (as cells, features, tags) need to be created manually and are required to build the other tables. The different steps of the process are detailed below.
Getting the sources
You should first clone the git repository:
$ git clone https://gitlab.com/finnic-morpho/parafin.git
$ cd parafin
We first create a virtual environment with the required dependencies:
$ make venv
We need to insert our transcription rules in the right place:
$ make epitran
For the frequencies, we download an dataset from the Finnish Parsebank:
$ make osf
Then, to generate the forms, we recover two internal files from omorfi:
$ make omorfi
Building the dataset
We build the lexemes and forms table:
$ make parse
Evaluating the transcription on dev forms:
$ make evaluate
Phonological transcription:
$ make transcription
Frequencies are extracted from the Finnish Parsebank:
$ make frequencies
Overabundant and defective forms are tagged:
$ make tag
Packaging & Validation
We produce Frictionless metadata:
$ make metadata
Check the conformity with Paralex standard:
$ make validate
It is possible to export a random sample (with fixed seed), for manual verifications:
$ make sample
References
This dataset is derived from the Omorfi HFST. See:
- Tommi A. Pirinen, Inari Listenmaa, Ryan Johnson, Francis M. Tyers, and Juha Kuokkala. Open morphology of Finnish. University of Helsinki, 2017.
- Tommi A Pirinen. Development and Use of Computational Morphology of Finnish in the Open Source and Open Science Era: Notes on Experiences with Omorfi Development. SKY Journal of Linguistics, 28:381–393, 2015.
The frequencies are from the LASTU software Finnish dataset, derived from the Finnish Parsebank:
- Sami Itkonen, Tuomo Häikiö, Seppo Vainio, and Minna Lehtonen. LASTU: A psycholinguistic search tool for Finnish lexical stimuli. Behavior Research Methods, 56(6):6165–6178, 2024. doi:10.3758/s13428-024-02347-x.
- Juhani Luotolahti, Jenna Kanerva, Veronika Laippala, Sampo Pyysalo, and Filip Ginter. Towards Universal Web Parsebanks. In Joakim Nivre and Eva Hajičová, editors, Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), 211–220. Uppsala, Sweden, 2015. Uppsala University, Uppsala, Sweden.
The transcriptions were build with a modified version of Epitran's Finnish module:
- David R. Mortensen, Siddharth Dalmia, and Patrick Littell. Epitran: Precision G2P for Many Languages. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2711–2714. Miyazaki, 2018. European Language Resources Association.