Title: | Classify Multilingual Labour Market Free-Text to Standardized Hierarchical Occupations |
---|---|
Description: | Allows the user to map multilingual free-text of occupations to a broad range of standardized classifications. The package facilitates automatic occupation coding (see, e.g., Gweon et al. (2017) <doi:10.1515/jos-2017-0006> and Turrell et al. (2019) <doi:10.3386/w25837>), where the ISCO to ESCO mapping is exploited to extend the occupations hierarchy, Le Vrang et al. (2014) <doi:10.1109/mc.2014.283>. Document vectorization is performed using the multilingual ESCO corpus. A method based on the nearest neighbor search is used to suggest the closest ISCO occupation. |
Authors: | Alexandros Kouretsis [aut, cre], Andreas Bampouris [aut], Petros Morfiris [aut], Konstantinos Papageorgiou [aut], Stavros Ladas [ctb], Athanassios Siaperas [ctb], Philippe Tissot [ctb], Nikos Vaslamatzis [ctb], Eworx S.A [cph] |
Maintainer: | Alexandros Kouretsis <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1.9000 |
Built: | 2024-11-12 04:40:56 UTC |
Source: | https://github.com/eworx-org/labourr |
This function takes advantage of the hierarchical structure of the ESCO-ISCO mapping and matches multilingual free-text with the ESCO occupations vocabulary in order to map semi-structured vacancy data into the official ESCO-ISCO classification.
classify_occupation( corpus, id_col = "id", text_col = "text", lang = "en", num_leaves = 10, isco_level = 3, max_dist = 0.1, string_dist = NULL )
classify_occupation( corpus, id_col = "id", text_col = "text", lang = "en", num_leaves = 10, isco_level = 3, max_dist = 0.1, string_dist = NULL )
corpus |
A data.frame or a data.table that contains the id and the text variables. |
id_col |
The name of the id variable. |
text_col |
The name of the text variable. |
lang |
The language that the text is in. |
num_leaves |
The number of occupations/neighbors that are kept when matching. |
isco_level |
The ISCO level of the suggested occupations. Can be either 1, 2, 3, 4 for ISCO occupations, or NULL that returns ESCO occupations. |
max_dist |
String distance used for fuzzy matching. The |
string_dist |
String dissimilarity measurement. Available string distance metrics: |
First, the input text is cleansed and tokenized. The tokens are then matched with the ESCO occupations vocabulary, created from
the preferred and alternative labels of the occupations. They are joined with the tfidf
weighted tokens of the ESCO occupations and the sum of the tf-idf score is used to retrieve the suggested ontologies. Technically speaking, the
suggested ESCO occupations are retrieved by solving the optimization problem,
where, stands for the binary representation of a query to the ESCO-vocabulary space,
while,
is the ESCO occupation normalized vector generated by the tf-idf numerical statistic.
If an ISCO level is specified, the k-nearest neighbors algorithm is used to determine the suggested occupation, classified by a plurality vote in the corresponding hierarchical level of its neighbors.
Before the suggestions are returned, the preferred label of each suggested occupation is added to the result, using the
occupations_bundle
and isco_occupations_bundle
as look-up tables.
Either a data.table with the id, the preferred label and the suggested ESCO occupation URIs (num_leaves predictions for each id), or a data.table with the id, the preferred label and the suggested ISCO group of the inputted level (one for each id).
M.P.J. van der Loo (2014). The stringdist package for approximate string matching. R Journal 6(1) pp 111-122.
Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M., & Steiner, S. (2017). Three Methods for Occupation Coding Based on Statistical Learning, Journal of Official Statistics, 33(1), 101-122.
Arthur Turrell, Bradley J. Speigner, Jyldyz Djumalieva, David Copple, James Thurgood (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings.
ESCO Service Platform - The ESCO Data Model documentation
corpus <- data.frame( id = 1:3, text = c( "Junior Architect Engineer", "Cashier at McDonald's", "Priest at St. Martin Catholic Church" ) ) classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5)
corpus <- data.frame( id = 1:3, text = c( "Junior Architect Engineer", "Cashier at McDonald's", "Priest at St. Martin Catholic Church" ) ) classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5)
The function performs text cleansing by removing escape characters, non alphanumeric, long-words, excess space, and turns all letters to lower case.
cleansing_corpus( text, escape_chars = TRUE, nonalphanum = TRUE, longwords = TRUE, whitespace = TRUE, tolower = TRUE )
cleansing_corpus( text, escape_chars = TRUE, nonalphanum = TRUE, longwords = TRUE, whitespace = TRUE, tolower = TRUE )
text |
Character vector of free text to be cleansed. |
escape_chars |
If TRUE, removes escape characters for |
nonalphanum |
If TRUE, removes non-alphanumeric characters. |
longwords |
If TRUE, removes words with more than 35 characters. |
whitespace |
If TRUE, removes excess whitespace. |
tolower |
If TRUE, turns letters to lower. |
A character vector of the cleansed text.
txt <- "It has roots in a piece of classical Latin literature from 45 BC" cleansing_corpus(txt)
txt <- "It has roots in a piece of classical Latin literature from 45 BC" cleansing_corpus(txt)
Occupations' labels and structure are exposed at the ESCO web portal. This function retrieves languages from the downloadable CSVs, see escopedia.
get_language_code(string)
get_language_code(string)
string |
Filepath with a language code as given by ESCO downloadable .CSVs. |
A character vector with two-letter language codes.
get_language_code("occupations_en.csv")
get_language_code("occupations_en.csv")
The functions retrieves stopwords from the stopwords
package using the ISO-639-1 encoding.
For miscellaneous languages data_stopwords_misc
are used.
get_stopwords(code)
get_stopwords(code)
code |
A string with the language code of the stopwords. |
Character vector with the stopwords or NULL if the language code is unknown.
get_stopwords("en")[1:10]
get_stopwords("en")[1:10]
This function performs language detection by using Compact Language Detector 2 from CRAN library cld2
.
It is vectorised and guesses the language of each string. Note that it is not designed to do well on very short text,
lists of proper names, part numbers, etc. CLD2 has the highest F1 score and is an order of magnitude faster than CLD3.
identify_language(text)
identify_language(text)
text |
A string with text to classify or a connection to read from.
|
A character vector with ISO-639-1 two-letter language codes.
txt <- c("English is a West Germanic language ", "In espaniol, le lingua castilian") identify_language(txt)
txt <- c("English is a West Germanic language ", "In espaniol, le lingua castilian") identify_language(txt)
The International Standard Classification of Occupations (ISCO) is a four-level classification of occupation groups managed by the International Labour Organisation (ILO). Its structure follows a grouping by education level. The two latest versions of ISCO are ISCO-88 (dating from 1988) and ISCO-08 (dating from 2008).
The ESCO version used is ESCO v1 1.0.3 retrieved at 11/12/2019.
isco_occupations_bundle
isco_occupations_bundle
A data.table with 2 variables, which are:
Four-level classification of occupation groups.
Preffered name of ISCO occupation concepts.
International Standard Classification of Occupations (ISCO).
The occupations pillar is one of the three pillars of ESCO. It organizes the occupation concepts. It uses hierarchical relationships between them, metadata as well as mappings to the International Standard Classification of Occupations ISCO in order to structure the occupations. The descriptions of each concept is provided only in English.
The ESCO version used is ESCO v1 1.0.3 retrieved at 11/12/2019.
occupations_bundle
occupations_bundle
A data.table with 5 variables, which are:
Uniform Resource Identifier of occupations.
Four-level classification of occupation groups, see ISCO.
Preffered name of ESCO occupation concepts.
Alternative labels of ESCO occupation concepts.
Description of ESCO occupation concepts.
European Skills/Competences, Qualifications and Occupations ESCO.
Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.
tf_idf( corpus, stopwords = NULL, id_col = "id", text_col = "text", tf_weight = "double_norm", idf_weight = "idf_smooth", min_chars = 2, norm = TRUE )
tf_idf( corpus, stopwords = NULL, id_col = "id", text_col = "text", tf_weight = "double_norm", idf_weight = "idf_smooth", min_chars = 2, norm = TRUE )
corpus |
Input data, with an id column and a text column. Can be of type data.frame or data.table. |
stopwords |
A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics. |
id_col |
Input data column name with the ids of the documents. |
text_col |
Input data column name with the documents. |
tf_weight |
Weighting scheme of term frequency. Choices are |
idf_weight |
Weighting scheme of inverse document frequency. Choices are |
min_chars |
Words with less characters than |
norm |
Boolean value for document normalization. |
A data.table with three columns, namely class
derived from given document ids, term
and tfIdf
.
library(data.table) corpus <- copy(occupations_bundle) invisible(corpus[, text := paste(preferredLabel, altLabels)]) invisible(corpus[, text := cleansing_corpus(text)]) corpus <- corpus[ , .(conceptUri, text)] setnames(corpus, c("id", "text")) tf_idf(corpus)
library(data.table) corpus <- copy(occupations_bundle) invisible(corpus[, text := paste(preferredLabel, altLabels)]) invisible(corpus[, text := cleansing_corpus(text)]) corpus <- corpus[ , .(conceptUri, text)] setnames(corpus, c("id", "text")) tf_idf(corpus)