Home   |   Structure   |   Research   |   Resources   |   Members   |   Training   |   Activities   |   Contact

EN | PT

CoPEP: Corpus of Portuguese from Academic Journals

The CoPEP Corpus (Corpus de Português Escrito em Periódicos) is a synchronic corpus of Portuguese made up of around 10.000 texts collected from academic journals from Brazil and Portugal. The corpus was prepared especially for a lexicographic project focussed on designing an online corpus-driven dictionary of Portuguese for university students (Kuhn, 2017). The corpus contains nearly 50 million tokens, which are distributed among three Schools of Knowledge, and further divided into six Great Areas (according to CAPES classification).

The subcorpora for each language variety are of almost the same size and consist of a similar number of words per both Great Areas and Schools, making the corpus evenly balanced. Metadata on the texts have been carefully recorded in order to allow advanced corpus search options, e.g. year of publication, Great Area of Knowledge and ISSN number.

For more detailed information, check the publications section.


How to cite this corpus:

Tanara Zingano Kuhn & José Pedro Ferreira (2018). CoPEP - Corpus de Português Escrito em Periódicos. Coimbra: CELGA-ILTEC.


Some data and numbers

Texts are distributed into three Schools of Knowledge, and further divided into six Great Areas (according to CAPES classification):

 

Schools of Knowledge

Humanities (HU)

Life Sciences (CV)

Exact, Earth and Multidisciplinary Sciences (CE)

Great Areas

Human Sciences (Hu)

Applied Social Sciences (Ap)

Health Sciences (He)

Ciências Agrícolas (Ag)

Engineering (En)

Exact and Earth Sciences (Ex)

 

The corpus contains nearly 50 million tokens and is finely balanced between varieties, both globally and within each scientific domain.

 

Corpus

Brazilian Portuguese

European Portuguese

Texts

9,900

3,811

6,089

Words

40,424,598

20,250,823

20,173,775

Tokens

(also for data below)

48,840,337

24,427,255

24,413,082

Humanities

 

30,988,552

15,460,402

15,528,150

 

Human Sciences

25,595,789

12,763,135

12,832,654

 

Social and Applied Sciences

5,392,763

2,697,267

2,695,496

Life Sciences

 

16,151,841

8,112,981

8,038,860

 

Health Sciences

13,540,819

6,797,058

6,743,761

 

Agricultural Sciences

2,611,022

1,315,923

1,295,099

Exact, Earth and Multidisciplinary Sciences

 

1,699,944

853,872

846,072

 

Exact and Earth Sciences

829,983

409,500

420,483

 

Engineering

869,961

444,372

425,589

 


This study was partly financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES - Finance Code 001), and in part by the Fundação para a Ciência e a Tecnologia - Portugal, through CELGA-ILTEC's Strategic Project (POCI-01-0145-FEDER-006986 - UID/LIN/04887/2013).