reset password
Author Message
khsu
Posts: 30
Posted 17:43 Jan 24, 2016 |

I'm not sure this is intended when this assignment was assigned, and I would like to point this out.

When we download the field as csv, and import it as a dataframe, all the empty fields will have a value of '?.' Thus any column that contains an 'empty' field will have object as dtype. This includes the one column that was meant to be Float64 to have its data type set as Object. The 2nd step of the assignment for this section is still do-able, but I assume we were expected to use the Pandas' method 'data_frame.describe()' to get the average, after we search the data_base for the empty values in that column.

As there are no specific instructions to what to do with non numerical columns that contains '?' for the supposedly empty values, are we to remove these rows when we convert these columns with a integer representation?

ahnman341
Posts: 22
Posted 18:46 Jan 24, 2016 |

I was working on converting column 0 into integers.  I still don't have a working implementation but I have a solution.

1.  Find the ratio of a's to b's

2.  Replace the appropriate number of '?' with a proportional number of a's and b's.

My point is, I think it's possible to convert columns of objects into columns of integers if you do enough "data munging".

The UCI site crashed (lol) so I can't verify if the data given was blank or had '?'s.  I'll post again once I can verify on the site. 

ahnman341
Posts: 22
Posted 18:49 Jan 24, 2016 |

it'd be really easy to "replace all: ? with blank" in excel. just saying

ahnman341
Posts: 22
Posted 18:55 Jan 24, 2016 |

uhh so regarding my last post.  "find ?" doesn't seem to work for me.  every time i hit "find next", it matches every cell in the .csv.  i think something weird happened after downloading the file and changing the file extension. 

khsu
Posts: 30
Posted 19:10 Jan 24, 2016 |
ahnman341 wrote:

uhh so regarding my last post.  "find ?" doesn't seem to work for me.  every time i hit "find next", it matches every cell in the .csv.  i think something weird happened after downloading the file and changing the file extension. 

Use '~?' under search instead of '?' without quotations.

This brings up another question, are we allowed to manually edit the file to change ? into actual empty values?

Last edited by khsu at 19:11 Jan 24, 2016.
ahnman341
Posts: 22
Posted 19:15 Jan 24, 2016 |

Simple solution lmao (many apologies for the email notification spam).  Replace all the ? with "".  Use notepad++ or sublime.  Pandas recognizes "" as NaN.  That converts columns 1 and 13 into numerical types.  That should make calling #.isnull() on DataFrames and Series work. 

msargent
Posts: 519
Posted 19:19 Jan 24, 2016 |
The idea was to get you to find these values and figure out how to replace them using the tools of pandas. Regarding the non-numerical fields one thing you could do is see how many different kinds of letters there are and assign each one integer. I's up to you to figure out how to do this. It will require some thought on your part but hey, you guys are all at least seniors or graduate students. You should be able to do that on your own. But you cannot manually alter the file you need to use pandas and or numpy to do it.
khsu
Posts: 30
Posted 19:44 Jan 24, 2016 |

"As there are no specific instructions to what to do with non numerical columns that contains '?' for the supposedly empty values, are we to remove these rows when we convert these columns with a integer representation?"

 

Allow me to clarify my original question.

The instructions asks us to convert non numerical columns with an integer expression, that is indeed doable, but my concern is this:

Since there are empty values in these fields, how are we suppose to handle them during the conversion? This is a conceptual question, I am not asking on how to do this in coding,

For example, for numerical values we replace the empty values with the average/median.

For non numerical values, let's say the first column. It is either 'a' or 'b'. Since there are '?' values in the mix, during conversion, let's say I assigned 'a' to be 1, and 'b' to be 2, then for '?' values should I randomly assign it 1 or 2, or assign half of the '?' values to be 1, and the other half to be 2, or should I consider the proportion of 'a' and 'b' then assign 1 and 2 accordingly to the empty values based on that proportion?

In actual practice, or as a rule of thumb, how are these empty data fields handled?

msargent
Posts: 519
Posted 20:27 Jan 24, 2016 |

One thing you could do is something like we did with the Titanic: for each row where a value is missing, find rows which have the same values for, say, two other  columns (here, since we don't know what they are in real life, just pick two) and see what the value they have (take a majority vote if there are more than one, that is, get the mode) in the missing column and substitute it. 

msargent
Posts: 519
Posted 20:44 Jan 24, 2016 |

One other easier idea: I've seen people substitute random values for missing values: find out the set of possible values for the letters and randomly select one to fill missing values. That should be good enough for our purposes.

khsu
Posts: 30
Posted 22:21 Jan 24, 2016 |

Right, thanks!

I wanted to make sure its alright if I select a random value for it since it will compromise the accuracy of the prediction. And also curious as to what is normally done for cases like this. 

 

P.S

On the previous discussion of editing '?' to empty values, instead of editing it manually through excel, we can accomplish the same by calling one of the Pandas method to change them to nand.

LuisFisher
Posts: 12
Posted 23:38 Jan 24, 2016 |

I Have had  no luck with that https://archive.ics.uci.edu/ml/datasets/Credit+Approval

does anyone have the dataset they could post

khsu
Posts: 30
Posted 00:28 Jan 25, 2016 |
LuisFisher wrote:

I Have had  no luck with that https://archive.ics.uci.edu/ml/datasets/Credit+Approval

does anyone have the dataset they could post

 

This is the CSV file of crx.data. I'm not sure why that site went down.

Attachments: