Home | Blog | What is TM-Town? | Directory Search | Nakōdo Expert Finder | Terminology Marketplace | Register | Log In

Natural Language Processing

TM-Town benefits from many open source natural language processing technologies and advancements. To give back to the community TM-Town strives to provide educational materials (through the TM-Town blog), NLP resources, and to open source some of TM-Town's internal algorithms (such as TM-Town's ruby segmentation gem Pragmatic Segmenter). This page contains a collection of natural language processing resources focused mainly on:


Segmentation

The importance of segmentation is often ignored in the literature on text alignment.

Segmentation in this section refers specifically to sentence segmentation - also known as sentence boundary disambiguation or sentence boundary detection. According to Wikipedia, sentence boundary disambiguation is defined as:

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Pragmatic Segmenter

Pragmatic Segmenter is TM-Town's open source segmentation tool for Ruby. Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.

The goal of Pragmatic Segmenter is to provide a "real-world" segmenter that works without any setup across many languages and does a reasonable job when the format and domain of the input text are unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.

Pragmatic Segmenter aims to improve on other segmentation engines in 2 main areas:

  1. Language support (most segmentation tools only focus on English)
  2. Text cleaning and preprocessing

Pragmatic Segmenter is opinionated and made for the explicit purpose of segmenting texts to create translation memories. Therefore, things such as parenthesis within a sentence are kept as one segment, even if technically there are two or more sentences within the segment in order to maintain coherence. The algorithm is also conservative in that if it comes across an ambiguous sentence boundary it will ignore it rather than splitting.

Tools, Libraries and Algorithms

*GRS = Golden Rule Score. See below to download the full Golden Rule Test Set.

Name GRS (English) GRS (Other Languages) Speed
Pragmatic Segmenter (51/52)
98.08%
(35/35)
100.00%
3.84 s
TactfulTokenizer (34/52)
65.38%
(17/35)
48.57%
46.32 s
OpenNLP (31/52)
59.62%
(16/35)
45.71%
1.27 s
Stanford CoreNLP (31/52)
59.62%
(11/35)
31.43%
0.92 s
Splitta (29/52)
55.77%
(13/35)
37.14%
N/A
Punkt (24/52)
46.15%
(17/35)
48.57%
1.79 s
srx-english (16/52)
30.77%
(10/35)
28.57%
6.19 s
Scapel (15/52)
28.85%
(7/35)
20.00%
0.13 s
FreeLing
Alpino
trtok
segtok
LingPipe
Elephant
Ucto: Unicode Tokenizer
tokenizer

GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.
Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.

Speed Performance Benchmarks

To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt, Standford CoreNLP and OpenNLP the tests were run using the Ruby port of the library.

The Golden Rules

Download Golden Rules: [txt, Ruby RSpec]

The Golden Rules are a set of tests developed by TM-Town that can be run through a segmenter to check its accuracy. This list is by no means complete and will evolve and expand over time.

English

  1. Simple period to end sentence
    Hello World. My name is Jonas.
    ["Hello World.", "My name is Jonas."]
  2. Question mark to end sentence
    What is your name? My name is Jonas.
    ["What is your name?", "My name is Jonas."]
  3. Exclamation point to end sentence
    There it is! I found it.
    ["There it is!", "I found it."]
  4. One letter upper case abbreviations
    My name is Jonas E. Smith.
    ["My name is Jonas E. Smith."]
  5. One letter lower case abbreviations
    Please turn to p. 55.
    ["Please turn to p. 55."]
  6. Two letter lower case abbreviations in the middle of a sentence
    Were Jane and co. at the party?
    ["Were Jane and co. at the party?"]
  7. Two letter upper case abbreviations in the middle of a sentence
    They closed the deal with Pitt, Briggs & Co. at noon.
    ["They closed the deal with Pitt, Briggs & Co. at noon."]
  8. Two letter lower case abbreviations at the end of a sentence
    Let's ask Jane and co. They should know.
    ["Let's ask Jane and co.", "They should know."]
  9. Two letter upper case abbreviations at the end of a sentence
    They closed the deal with Pitt, Briggs & Co. It closed yesterday.
    ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]
  10. Two letter (prepositive) abbreviations
    I can see Mt. Fuji from here.
    ["I can see Mt. Fuji from here."]
  11. Two letter (prepositive & postpositive) abbreviations
    St. Michael's Church is on 5th st. near the light.
    ["St. Michael's Church is on 5th st. near the light."]
  12. Possesive two letter abbreviations
    That is JFK Jr.'s book.
    ["That is JFK Jr.'s book."]
  13. Multi-period abbreviations in the middle of a sentence
    I visited the U.S.A. last year.
    ["I visited the U.S.A. last year."]
  14. Multi-period abbreviations at the end of a sentence
    I live in the E.U. How about you?
    ["I live in the E.U.", "How about you?"]
  15. U.S. as sentence boundary
    I live in the U.S. How about you?
    ["I live in the U.S.", "How about you?"]
  16. U.S. as non sentence boundary with next word capitalized
    I work for the U.S. Government in Virginia.
    ["I work for the U.S. Government in Virginia."]
  17. U.S. as non sentence boundary
    I have lived in the U.S. for 20 years.
    ["I have lived in the U.S. for 20 years."]
  18. A.M. / P.M. as non sentence boundary and sentence boundary
    At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.
    ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
  19. Number as non sentence boundary
    She has $100.00 in her bag.
    ["She has $100.00 in her bag."]
  20. Number as sentence boundary
    She has $100.00. It is in her bag.
    ["She has $100.00.", "It is in her bag."]
  21. Parenthetical inside sentence
    He teaches science (He previously worked for 5 years as an engineer.) at the local University.
    ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]
  22. Email addresses
    Her email is [email protected]. I sent her an email.
    ["Her email is [email protected].", "I sent her an email."]
  23. Web addresses
    The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.
    ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]
  24. Single quotations inside sentence
    She turned to him, 'This is great.' she said.
    ["She turned to him, 'This is great.' she said."]
  25. Double quotations inside sentence
    She turned to him, "This is great." she said.
    ["She turned to him, \"This is great.\" she said."]
  26. Double quotations at the end of a sentence
    She turned to him, \"This is great.\" She held the book out to show him.
    ["She turned to him, \"This is great.\"", "She held the book out to show him."]
  27. Double punctuation (exclamation point)
    Hello!! Long time no see.
    ["Hello!!", "Long time no see."]
  28. Double punctuation (question mark)
    Hello?? Who is there?
    ["Hello??", "Who is there?"]
  29. Double punctuation (exclamation point / question mark)
    Hello!? Is that you?
    ["Hello!?", "Is that you?"]
  30. Double punctuation (question mark / exclamation point)
    Hello?! Is that you?
    ["Hello?!", "Is that you?"]
  31. List (period followed by parens and no period to end item)
    1.) The first item 2.) The second item
    ["1.) The first item", "2.) The second item"]
  32. List (period followed by parens and period to end item)
    1.) The first item. 2.) The second item.
    ["1.) The first item.", "2.) The second item."]
  33. List (parens and no period to end item)
    1) The first item 2) The second item
    ["1) The first item", "2) The second item"]
  34. List (parens and period to end item)
    1) The first item. 2) The second item.
    ["1) The first item.", "2) The second item."]
  35. List (period to mark list and no period to end item)
    1. The first item 2. The second item
    ["1. The first item", "2. The second item"]
  36. List (period to mark list and period to end item)
    1. The first item. 2. The second item.
    ["1. The first item.", "2. The second item."]
  37. List with bullet
    • 9. The first item • 10. The second item
    ["• 9. The first item", "• 10. The second item"]
  38. List with hypthen
    ⁃9. The first item ⁃10. The second item
    ["⁃9. The first item", "⁃10. The second item"]
  39. Alphabetical list
    a. The first item b. The second item c. The third list item
    ["a. The first item", "b. The second item", "c. The third list item"]
  40. Errant newline in the middle of a sentence (PDF)
    This is a sentence\ncut off in the middle because pdf.
    ["This is a sentence\ncut off in the middle because pdf."]
  41. Errant newline in the middle of a sentence
    It was a cold \nnight in the city.
    ["It was a cold night in the city."]
  42. Lower case list separated by newline
    features\ncontact manager\nevents, activities\n
    ["features", "contact manager", "events, activities"]
  43. Geo Coordinates
    You can find it at N°. 1026.253.553. That is where the treasure is.
    ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]
  44. Named entities with an exclamation point
    She works at Yahoo! in the accounting department.
    ["She works at Yahoo! in the accounting department."]
  45. I as a sentence boundary and I as an abbreviation
    We make a good team, you and I. Did you see Albert I. Jones yesterday?
    ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]
  46. Ellipsis at end of quotation
    Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”
    ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]
  47. Ellipsis with square brackets
    "Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).
    ["\"Bohr [...] used the analogy of parallel stairways [...]\" (Smith 55)."]
  48. Ellipsis as sentence boundary (standard ellipsis rules)
    If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.
    ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]
  49. Ellipsis as sentence boundary (non-standard ellipsis rules)
    I never meant that.... She left the store.
    ["I never meant that....", "She left the store."]
  50. Ellipsis as non sentence boundary
    I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.
    ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."
  51. 4-dot ellipsis
    One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .
    ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."]
  52. No whitespace in between sentences Credit: Don_Patrick
    Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.
    ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]

German

  1. Quotation at end of sentence
    „Ich habe heute keine Zeit“, sagte die Frau und flüsterte leise: „Und auch keine Lust.“ Wir haben 1.000.000 Euro.
    ["„Ich habe heute keine Zeit“, sagte die Frau und flüsterte leise: „Und auch keine Lust.“", "Wir haben 1.000.000 Euro."]
  2. Abbreviations
    Es gibt jedoch einige Vorsichtsmaßnahmen, die Du ergreifen kannst, z. B. ist es sehr empfehlenswert, dass Du Dein Zuhause von allem Junkfood befreist.
    ["Es gibt jedoch einige Vorsichtsmaßnahmen, die Du ergreifen kannst, z. B. ist es sehr empfehlenswert, dass Du Dein Zuhause von allem Junkfood befreist."]
  3. Numbers
    Was sind die Konsequenzen der Abstimmung vom 12. Juni?
    ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]

Japanese

  1. Simple period to end sentence
    これはペンです。それはマーカーです。
    ["これはペンです。", "それはマーカーです。"]
  2. Question mark to end sentence
    それは何ですか?ペンですか?
    ["それは何ですか?", "ペンですか?"]
  3. Exclamation point to end sentence
    良かったね!すごい!
    ["良かったね!", "すごい!"]
  4. Quotation
    自民党税制調査会の幹部は、「引き下げ幅は3.29%以上を目指すことになる」と指摘していて、今後、公明党と合意したうえで、30日に決定する与党税制改正大綱に盛り込むことにしています。
    ["自民党税制調査会の幹部は、「引き下げ幅は3.29%以上を目指すことになる」と指摘していて、今後、公明党と合意したうえで、30日に決定する与党税制改正大綱に盛り込むことにしています。"]
  5. Errant newline in the middle of a sentence
    これは父の\n家です。
    ["これは父の家です。"]

Arabic

  1. Regular punctuation
    سؤال وجواب: ماذا حدث بعد الانتخابات الايرانية؟ طرح الكثير من التساؤلات غداة ظهور نتائج الانتخابات الرئاسية الايرانية التي أججت مظاهرات واسعة واعمال عنف بين المحتجين على النتائج ورجال الامن. يقول معارضو الرئيس الإيراني إن الطريقة التي اعلنت بها النتائج كانت مثيرة للاستغراب.
    ["سؤال وجواب:", "ماذا حدث بعد الانتخابات الايرانية؟", "طرح الكثير من التساؤلات غداة ظهور نتائج الانتخابات الرئاسية الايرانية التي أججت مظاهرات واسعة واعمال عنف بين المحتجين على النتائج ورجال الامن.", "يقول معارضو الرئيس الإيراني إن الطريقة التي اعلنت بها النتائج كانت مثيرة للاستغراب."]
  2. Abbreviations
    وقال د‪.‬ ديفيد ريدي و الأطباء الذين كانوا يعالجونها في مستشفى برمنجهام إنها كانت تعاني من أمراض أخرى. وليس معروفا ما اذا كانت قد توفيت بسبب اصابتها بأنفلونزا الخنازير.
    ["وقال د‪.‬ ديفيد ريدي و الأطباء الذين كانوا يعالجونها في مستشفى برمنجهام إنها كانت تعاني من أمراض أخرى.", "وليس معروفا ما اذا كانت قد توفيت بسبب اصابتها بأنفلونزا الخنازير."]
  3. Numbers and Dates
    ومن المنتظر أن يكتمل مشروع خط أنابيب نابوكو البالغ طوله 3300 كليومترا في 12‪/‬08‪/‬2014 بتكلفة تُقدر بـ 7.9 مليارات يورو أي نحو 10.9 مليارات دولار. ومن المقرر أن تصل طاقة ضخ الغاز في المشروع 31 مليار متر مكعب انطلاقا من بحر قزوين مرورا بالنمسا وتركيا ودول البلقان دون المرور على الأراضي الروسية.
    ["ومن المنتظر أن يكتمل مشروع خط أنابيب نابوكو البالغ طوله 3300 كليومترا في 12‪/‬08‪/‬2014 بتكلفة تُقدر بـ 7.9 مليارات يورو أي نحو 10.9 مليارات دولار.", "ومن المقرر أن تصل طاقة ضخ الغاز في المشروع 31 مليار متر مكعب انطلاقا من بحر قزوين مرورا بالنمسا وتركيا ودول البلقان دون المرور على الأراضي الروسية."]
  4. Time
    الاحد, 21 فبراير/ شباط, 2010, 05:01 GMT الصنداي تايمز: رئيس الموساد قد يصبح ضحية الحرب السرية التي شتنها بنفسه. العقل المنظم هو مئير داجان رئيس الموساد الإسرائيلي الذي يشتبه بقيامه باغتيال القائد الفلسطيني في حركة حماس محمود المبحوح في دبي.
    ["الاحد, 21 فبراير/ شباط, 2010, 05:01 GMT الصنداي تايمز:", "رئيس الموساد قد يصبح ضحية الحرب السرية التي شتنها بنفسه.", "العقل المنظم هو مئير داجان رئيس الموساد الإسرائيلي الذي يشتبه بقيامه باغتيال القائد الفلسطيني في حركة حماس محمود المبحوح في دبي."]
  5. Comma
    عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب، زرعها عملاء الموساد كما تقول مصادر إسرائيلية، وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية، وبدأت مراسم الحداد عليه
    ["عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب،", "زرعها عملاء الموساد كما تقول مصادر إسرائيلية،", "وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية،", "وبدأت مراسم الحداد عليه"]

Italian

  1. Abbreviations
    Salve Sig.ra Mengoni! Come sta oggi?
    ["Salve Sig.ra Mengoni!", "Come sta oggi?"]
  2. Quotations
    Una lettera si può iniziare in questo modo «Il/la sottoscritto/a.».
    ["Una lettera si può iniziare in questo modo «Il/la sottoscritto/a.»."]
  3. Numbers
    La casa costa 170.500.000,00€!
    ["La casa costa 170.500.000,00€!"]

Russian

  1. Abbreviations
    Объем составляет 5 куб.м.
    ["Объем составляет 5 куб.м."]
  2. Quotations
    Маленькая девочка бежала и кричала: «Не видали маму?».
    ["Маленькая девочка бежала и кричала: «Не видали маму?»."]
  3. Numbers
    Сегодня 27.10.14
    ["Сегодня 27.10.14"]

Spanish

  1. Question mark to end sentence
    ¿Cómo está hoy? Espero que muy bien.
    ["¿Cómo está hoy?", "Espero que muy bien."]
  2. Exclamation point to end sentence
    ¡Hola señorita! Espero que muy bien.
    ["¡Hola señorita!", "Espero que muy bien."]
  3. Abbreviations
    Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre, el Dr. Naser.
    ["Hola Srta. Ledesma.", "Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre, el Dr. Naser."]
  4. Numbers
    ¡La casa cuesta $170.500.000,00! ¡Muy costosa! Se prevé una disminución del 12.5% para el próximo año.
    ["¡La casa cuesta $170.500.000,00!", "¡Muy costosa!", "Se prevé una disminución del 12.5% para el próximo año."]
  5. Quotations
    «Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles.
    ["«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles."]

Greek

  1. Question mark to end sentence
    Με συγχωρείτε· πού είναι οι τουαλέτες; Τις Κυριακές δε δούλευε κανένας. το κόστος του σπιτιού ήταν £260.950,00.
    ["Με συγχωρείτε· πού είναι οι τουαλέτες;", "Τις Κυριακές δε δούλευε κανένας.", "το κόστος του σπιτιού ήταν £260.950,00."]

Hindi

  1. Full stop
    सच्चाई यह है कि इसे कोई नहीं जानता। हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।
    ["सच्चाई यह है कि इसे कोई नहीं जानता।", "हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।"]

Armenian

  1. Sentence ending punctuation
    Ի՞նչ ես մտածում: Ոչինչ:
    ["Ի՞նչ ես մտածում:", "Ոչինչ:"]
  2. Ellipsis
    Ապրիլի 24-ին սկսեց անձրևել...Այդպես էի գիտեի:
    ["Ապրիլի 24-ին սկսեց անձրևել...Այդպես էի գիտեի:"]
  3. Period is not a sentence boundary
    Այսպիսով` մոտենում ենք ավարտին: Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:
    ["Այսպիսով` մոտենում ենք ավարտին:", "Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:"]

Burmese

  1. Sentence ending punctuation
    ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။၇ွင္ေနေကာင္းလား။
    ["ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။", "၇ွင္ေနေကာင္းလား။"]

Amharic

  1. Sentence ending punctuation
    እንደምን አለህ፧መልካም ቀን ይሁንልህ።እባክሽ ያልሽዉን ድገሚልኝ።
    ["እንደምን አለህ፧", "መልካም ቀን ይሁንልህ።", "እባክሽ ያልሽዉን ድገሚልኝ።"]

Persian

  1. Sentence ending punctuation
    خوشبختم، آقای رضا. شما کجایی هستید؟ من از تهران هستم.
    ["خوشبختم، آقای رضا.", "شما کجایی هستید؟", "من از تهران هستم."]

Urdu

  1. Sentence ending punctuation
    کیا حال ہے؟ ميرا نام ___ ەے۔ میں حالا تاوان دےدوں؟
    ["کیا حال ہے؟", "ميرا نام ___ ەے۔", "میں حالا تاوان دےدوں؟"]

Dutch

  1. Sentence starting with a number
    Hij schoot op de JP8-brandstof toen de Surface-to-Air (sam)-missiles op hem af kwamen. 81 procent van de schoten was raak.
    ["Hij schoot op de JP8-brandstof toen de Surface-to-Air (sam)-missiles op hem af kwamen.", "81 procent van de schoten was raak."]
  2. Sentence starting with an ellipsis
    81 procent van de schoten was raak. ...en toen barste de hel los.
    ["81 procent van de schoten was raak.", "...en toen barste de hel los."]

Papers and Books


Alignment

Papers and Books

Tools, Libraries and Algorithms

Coming soon...


Pre-processed Parallel Corpora

Coming soon...