Author | Message |
---|---|
pgatsby
Posts: 10
|
Posted 20:37 Nov 01, 2018 |
Hello,
as you can see the fourth place in the gui is white space that is counted as a "word". I was thinking of having a dictionary file from the previous lab to verify if the string is a word but I thought that would be redundant and the lab does not ask for this.
*EDIT* I tried to Iterate through the list again and remove empty strings; it works Iterating through the list dramatically reduced the amount of white space but for large pdf files white space will slip by. Last edited by pgatsby at
20:47 Nov 01, 2018.
|
pgatsby
Posts: 10
|
Posted 20:39 Nov 01, 2018 |
as you can see the white space is being counted. |
kcrespi
Posts: 12
|
Posted 21:54 Nov 01, 2018 |
Is counting the space as a word, if you are saving the whole pdf text in a string and then spliting it by spaces, make sure to split it by space or spaces (one space or more). So in your string.split(“\\s”); change it to string.split(“\\s+”); |
pgatsby
Posts: 10
|
Posted 22:17 Nov 01, 2018 |
interesting, I implemented this . string.replaceAll("[^a-zA-Z0-9\\s+]", "").trim(); The issue with the program or my code is even if I clean the String as I parse it using a clean(String s) assuming s contains only letters it still let blank spaces in or when i iterate through the list of words and check if the string is empty some blank spaces pass by. However this only happens with big pdf files with many line breaks, and small pdf files with few line breaks are not affected by this. |
kcrespi
Posts: 12
|
Posted 23:01 Nov 01, 2018 |
This is how I cleaned my string stripper.getText(pdf).trim().toLowerCase.replaceAll(“[^A-Za-z0-9 ]”, “”).split(“\\s+”)); and I didn’t have problems. |