Options
Savoy, Jacques
Nom
Savoy, Jacques
Affiliation principale
Fonction
Professeur.e ordinaire
Email
jacques.savoy@unine.ch
Identifiants
Résultat de la recherche
Voici les éléments 1 - 10 sur 109
- PublicationAccès libreCatégorisation de documents: applications en attribution d’auteur et analyse stylistique(2017)La catégorisation de documents (attribution d'un texte à une ou plusieurs catégories prédéfinies) représente un problème possédant de multiples facettes. Ainsi, l'indexation automatique correspond à l'une d'entre elles qui se fonde sur la sémantique des documents. Cependant d'autres applications analysent les mots outils, ces formes qui ne portent que peu ou pas de sens. Or ces dernières permettent, en grande partie, de décrire le style d'un auteur voire de déterminer quelques aspects de son profil. Sur la base de ces éléments, nous allons présenter comment identifier le véritable auteur d'un document, ou savoir si celui-ci a été écrit par un homme ou une femme. Afin d'illustrer nos propos, nous aborderons le cas d'Elena Ferrante, un pseudonyme mondialement connu depuis la parution de son roman L'amie prodigieuse (Gallimard, 2016). Comme autre exemple, nous analyserons les discours des présidents américains de G. Washington (1789) à D. Trump (2017) afin d'en découvrir quelques traces évolutives tant stylistiques que thématiques. Dans ce dernier cas, une synthèse sera extraite d'un corpus de discours sous la forme d'un graphique décrivant les rapprochements entre présidences.
- PublicationAccès libreTerm Proximity Scoring for Keyword-Based Retrieval Systems(2015)
;Rasolofo, YvesThis paper suggests the use of proximity measurement in combination with the Okapi probabilistic model. First, using the Okapi system, our investigation was carried out in a distributed retrieval framework to calculate the same relevance score as that achieved by a single centralized index. Second, by applying a term-proximity scoring heuristic to the top documents returned by a keyword-based system, our aim is to enhance retrieval performance. Our experiments were conducted using the TREC8, TREC9 and TREC10 test collections, and show that the suggested approach is stable and generally tends to improve retrieval effectiveness especially at the top documents retrieved. - PublicationAccès libreLa voix du Président américain (1934-2014)(2014)Dans cette communication, nous présentons une analyse lexicale d’un corpus composé des discours sur l’état de l’Union de 1934 à 2014. Ce corpus couvre environ 80 ans de vie gouvernementale américaine avec les allocutions tenues par treize présidents. Cette étude indique que les lemmes les plus fréquents n’apportent pas d’information très pertinente. Par contre, en observant la distribution des catégories grammaticales, nous constatons que Eisenhower ou Kennedy recourent de manière plus fréquente aux groupes nominaux tandis que Obama tend à favoriser les verbes. Avec les années, on constate une légère préférence pour des phrases plus courtes. En s’appuyant sur une distance intertextuelle, nous remarquons que les allocutions tenues par le même président tendent habituellement à se regrouper entre elles. Cette tendance n’est pas générale et certains discours de Reagan ou Bush (père) ont tendance à se regrouper avec d’autres allocutions. En appliquant un modèle à thèmes (topic model), nous constatons que quelques présidences se concentrent sur un thème distinctif (par exemple, Nixon, Bush (son), ou Obama) tandis que d’autres abordent plusieurs sujets (par exemple, Kennedy)., This paper describes a lexical study over the State of the Union addresses from 1934 until 2014. This corpus contains 81 governmental speeches uttered by thirteen presidents. This study shows that considering the most frequent lemmas does not provide useful and pertinent information. However when analyzing the part-of-speech (POS) distribution according to each president, we can see that some presidents such as Eisenhower or Kennedy are using more frequently noun phrases while others (e.g., Obama) prefer using more verbs. When observing the sentence length, we notice that the mean sentence tends to be shorter over the years. Based on an intertextual distance, this study demonstrates that speeches given by the same president tend to be very similar. This is not strong pattern and, for example, some of Reagan or Bush’s (father) speeches tend to cluster with other interventions. Using a topic model (latent Dirichlet allocation), we found that some presidents (e.g., Nixon, Bush (son), Obama) tend to concentrate on a single and distinctive topic while speeches given by other presidents tend to cover different topics (e.g., Kennedy).
- PublicationMétadonnées seulement
- PublicationAccès libreAuthorship attribution based on a probabilistic topic model(2013)This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the Χ2 distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kullback–Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.
- PublicationAccès libreCultural Heritage in CLEF (CHiC)(2013)
;Petras, Vivien ;Bogers, Toine ;Toms, Elaine ;Hall, Mark; ;Malak, Piotr ;Pawłowski, Adam ;Ferro, NicolaMasiero IvanoThe Cultural Heritage in CLEF 2013 lab comprised three tasks: multilingual ad-hoc retrieval and semantic enrichment in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian, Norwegian, Polish, Slovenian, Spanish, and Swedish), Polish ad-hoc retrieval and the interactive task, which studied user behavior via log analysis and questionnaires. For the multilingual and Polish sub-tasks, more than 170,000 documents were assessed for relevance on a tertiary scale. The multilingual task had 7 participants submitting 30 multilingual and 41 monolingual runs. The Polish task comprised 3 participating groups submitting manual and automatic runs. The interactive task had 4 participating research groups and 208 user participants in the study. For the multilingual task, results show that more participants are necessary in order to provide comparative analyses. The interactive task created a rich data set comprising of questionnaire of log data. Further analysis of the data is planned in the future. - PublicationAccès libreInformation retrieval with Hindi, Bengali, and Marathi languages: evaluation and analysis(2013)
; ;Akasereh, MitraDolamic, LjiljanaOur first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their retrieval effectiveness with both light stemming strategy and n-gram language-independent approach. As another language-independent indexing strategy, we have evaluated the trunc-n method in which the indexing term is formed by considering only the first n letters of each word. To evaluate these solutions we have used various IR models including models derived from Divergence from Randomness (DFR), Language Model (LM) as well as Okapi, or the classical tf idf vector-processing approach.
For the three studied languages, our experiments tend to show that IR models derived from Divergence from Randomness (DFR) paradigm tend to produce the best overall results. For these languages, our various experiments demonstrate also that either an aggressive stemming procedure or the trunc-n indexing approach produces better retrieval effectiveness when compared to other word-based or n-gram language-independent approaches. Applying the Z-score as data fusion operator after a blind-query expansion tends also to improve the MAP of the merged run over the best single IR system. - PublicationAccès libreAd hoc retrieval with Marathi language(2013)
;Akasereh, MitraOur goal in participating in FIRE 2011 evaluation campaign is to analyse and evaluate the retrieval effectiveness of our implemented retrieval system when using Marathi language. We have developed a light and an aggressive stemmer for this language as well as a stopword list. In our experiment seven different IR models (language model, DFR-PL2, DFR-PB2, DFR-GL2, DFR-I(n e)C2, tf idf and Okapi) were used to evaluate the influence of these stemmers as well as n-grams and trunc-n language-independent indexing strategies, on retrieval performance. We also applied a pseudo relevance-feedback or blind-query expansion approach to estimate the impact of this approach on enhancing the retrieval effectiveness. Our results show that for Marathi language DFR-I(n e)C2, DFR-PL2 and Okapi IR models result the best performance. For this language trunc-n indexing strategy gives the best retrieval effectiveness comparing to other stemming and indexing approaches. Also the adopted pseudo-relevance feedback approach tends to enhance the retrieval effectiveness. - PublicationMétadonnées seulement
- PublicationMétadonnées seulement