reset password
Author Message
raywu64
Posts: 44
Posted 18:25 Feb 19, 2015 |

I'm struggling a bit with cluster programming. I'm doing lab4, with send and receive, I get the right output, but the program never terminates. I have MPI_Finalize at the end of the code, and I'm only using 4 notes. Did anyone else have trouble with this?
 

alexnassif
Posts: 12
Posted 14:00 Feb 20, 2015 |

Yeah, i get the same thing only for the 4th lab.   

jwarren6
Posts: 56
Posted 14:35 Feb 20, 2015 |

It's hard to say what is going on with your code. Double check that your lam environment is booted correctly (that seems to be a source of problems for a lot of different reasons). Also, Professor Booker has helped other people with their code problems. Go see him next week or even better, email him with specific questions. If all else fails, Dr. Pamula seems to be very lenient about late submissions with no penalties.

equinta4
Posts: 5
Posted 18:36 Feb 21, 2015 |

I'm having this exact same issue but with MPI_Reduce for Lab4. I get the correct answer that i want but the program never terminates. I have MPI_Finalize at the end as well. Is it maybe because I use N to call all nodes even though I'm only using 4 of them? Are the other nodes hanging? 

jwarren6
Posts: 56
Posted 21:10 Feb 21, 2015 |
equinta4 wrote:

I'm having this exact same issue but with MPI_Reduce for Lab4. I get the correct answer that i want but the program never terminates. I have MPI_Finalize at the end as well. Is it maybe because I use N to call all nodes even though I'm only using 4 of them? Are the other nodes hanging? 

Are you handling the extra nodes somewhere in your program? If you don't and your lam environment has more than four nodes, that could be a source of problems. Use mpirun n0-3 <prog_name> to run your program.

raywu64
Posts: 44
Posted 09:04 Feb 22, 2015 |
jwarren6 wrote:
equinta4 wrote:

I'm having this exact same issue but with MPI_Reduce for Lab4. I get the correct answer that i want but the program never terminates. I have MPI_Finalize at the end as well. Is it maybe because I use N to call all nodes even though I'm only using 4 of them? Are the other nodes hanging? 

Are you handling the extra nodes somewhere in your program? If you don't and your lam environment has more than four nodes, that could be a source of problems. Use mpirun n0-3 <prog_name> to run your program.

My lab4 only uses 3 nodes. I thought what you just said was the issue, so I did try mpirun n0-3 but that still didn't terminate. I also tried implementing receive and send to the other nodes (not to do any data manipulation but just to have the data be sent to inactive nodes), but that didn't work as well. 

I talked to Tarik on friday, he said it could be a bugged environment and to talk to Dr. Pamula about it. On a related note, my other labs work well, so this is kind of weird.  

jwarren6
Posts: 56
Posted 13:18 Feb 22, 2015 |
raywu64 wrote:
 
My lab4 only uses 3 nodes. I thought what you just said was the issue, so I did try mpirun n0-3 but that still didn't terminate. I also tried implementing receive and send to the other nodes (not to do any data manipulation but just to have the data be sent to inactive nodes), but that didn't work as well. 

I talked to Tarik on friday, he said it could be a bugged environment and to talk to Dr. Pamula about it. On a related note, my other labs work well, so this is kind of weird.  

I also had problems with this lab and it was because I was running the program with 8 nodes and only handling 4 of them.

By the way, your program should be using four nodes: node 0 to add everything and nodes 1 - 3 to handle each row/column sum.

Last edited by jwarren6 at 13:21 Feb 22, 2015.
raywu64
Posts: 44
Posted 13:55 Feb 22, 2015 |
jwarren6 wrote:
raywu64 wrote:
 
My lab4 only uses 3 nodes. I thought what you just said was the issue, so I did try mpirun n0-3 but that still didn't terminate. I also tried implementing receive and send to the other nodes (not to do any data manipulation but just to have the data be sent to inactive nodes), but that didn't work as well. 

I talked to Tarik on friday, he said it could be a bugged environment and to talk to Dr. Pamula about it. On a related note, my other labs work well, so this is kind of weird.  

I also had problems with this lab and it was because I was running the program with 8 nodes and only handling 4 of them.

By the way, your program should be using four nodes: node 0 to add everything and nodes 1 - 3 to handle each row/column sum.

For send/receive? If so, does that mean you're passing around an array of some kind then the array finally gets to the 0th node to be added? What I did was i just sent around a "total" and I added each row/col 2x2 determinant to be added to that total. 

jwarren6
Posts: 56
Posted 14:58 Feb 22, 2015 |
raywu64 wrote:

For send/receive? If so, does that mean you're passing around an array of some kind then the array finally gets to the 0th node to be added? What I did was i just sent around a "total" and I added each row/col 2x2 determinant to be added to that total. 

I let node0 handle input and then broadcast the array to the other 3 nodes. Then, each one of the other nodes does their calculations, sends this calculation to node0, and node0 receives them all. For the other version of the program, after broadcasting the array from node0, all nodes perform the reduce function. Finally, node0 displays the total determinant.