[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some clarifications on the LDA project



Hi All,

some of you might have questions regarding the LDA project. I made
some clarification below. Hope they will help you finish your project.

input file:
----
for news corpus, please treat each file as a document. Then removing
"newlines" of the document and make it as a line of text in the input
file. In other words, you do not need to consider paragraphs. As a
result, you will have multiple lines in the input file.

Output file:
---
Please follow this format for your output:

Topic id: top word 1, top word 2, ...... top word 15

so when topic=100, you only have 100 lines in your output. Please note
that, the ordering of word per topic is ranked based on on their
probability to the topic (from highest to lowest).

Optional question
---
you can come up as many suggestions as you want, as long as they make
sense to you (note that you need to justify your suggestion by saying
why it might help improve the performance).

LDA program
---
There are two ways to run the program: 1) you can import the program
into Eclipse and run it there. (you can specific the location of
output file easily). or 2) you can use the commend line. Either way is
fine.

Thanks.

--
Best regards,
Yuheng Hu