reset password
Author Message
pgatsby
Posts: 10
Posted 20:37 Nov 01, 2018 |

Hello,
So I have the assignment done! There is a bug in which when there is a break line in the pdf java will save this as an entry in a list. I tried my best to remove it by doing if statements to test if the String was empty so it does not save however when I debug it Java just lets an whitespace (where the line break is) pass by and any other white space is removed. I find it odd, does anyone else have this problem? I tried to clean both the list and map. 

 

as you can see the fourth place in the gui is white space that is counted as a "word". I was thinking of having a dictionary file from the previous lab to verify if the string is a word but I thought that would be redundant and the lab does not ask for this.

 

*EDIT*

I tried to Iterate through the list again and remove empty strings; it works Iterating through the list dramatically reduced the amount of white space but for large pdf files white space will slip by.

Last edited by pgatsby at 20:47 Nov 01, 2018.
pgatsby
Posts: 10
Posted 20:39 Nov 01, 2018 |

as you can see the white space is being counted. 

kcrespi
Posts: 12
Posted 21:54 Nov 01, 2018 |

Is counting the space as a word, if you are saving the whole pdf text in a string and then spliting it by spaces, make sure to split it by space or spaces (one space or more). So in your string.split(“\\s”); change it to string.split(“\\s+”);

pgatsby
Posts: 10
Posted 22:17 Nov 01, 2018 |
kcrespi wrote:

Is counting the space as a word, if you are saving the whole pdf text in a string and then spliting it by spaces, make sure to split it by space or spaces (one space or more). So in your string.split(“\\s”); change it to string.split(“\\s+”);

interesting, I implemented this . string.replaceAll("[^a-zA-Z0-9\\s+]", "").trim(); 

The issue with the program or my code is even if I clean the String as I parse it using a clean(String s) assuming s contains only letters it still let blank spaces in or when i iterate through the list of words and check if the string is empty some blank spaces pass by. However this only happens with big pdf files with many line breaks, and small pdf files with few line breaks are not affected by this.

kcrespi
Posts: 12
Posted 23:01 Nov 01, 2018 |
pgarci71 wrote:
kcrespi wrote:

Is counting the space as a word, if you are saving the whole pdf text in a string and then spliting it by spaces, make sure to split it by space or spaces (one space or more). So in your string.split(“\\s”); change it to string.split(“\\s+”);

interesting, I implemented this . string.replaceAll("[^a-zA-Z0-9\\s+]", "").trim(); 

The issue with the program or my code is even if I clean the String as I parse it using a clean(String s) assuming s contains only letters it still let blank spaces in or when i iterate through the list of words and check if the string is empty some blank spaces pass by. However this only happens with big pdf files with many line breaks, and small pdf files with few line breaks are not affected by this.

This is how I cleaned my string 

stripper.getText(pdf).trim().toLowerCase.replaceAll(“[^A-Za-z0-9 ]”, “”).split(“\\s+”));

and I didn’t have problems.