[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Some clarifications on the LDA project
- To: undisclosed-recipients:;
- Subject: Some clarifications on the LDA project
- From: Yuheng Hu <wonderfulhoo@gmail.com>
- Date: Tue, 4 Dec 2012 22:20:41 -0700
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=7b3cjCewLkyzP6f9YGbWtUB73HyR6UzCeaml5JFeBGY=; b=JiDN7TaD480+0QruXX5bvkuDkhoFWmim+jfkCS9bx7LT9ZZnlzDACKCHNRUqcHJJF8 wbQJqMuifirFzwNXyWYOvjOodN9Y869z1p/wM8uPaDeJ7ZvJtWr3BigJZveFBGszRIeT 47SEnX7idxFctKhofw3B8sZku/8j0ofsEMUi5W3JJ6ySTLlxJWoIg7x8nCySQm9V9CuA rD8Nixk6EsbP7s9davKSQ12Q7sNo5EpEzmrMUTWXfDpEAJA6azbnw3Nh3lfsVvBXPk+i Q9etvf+2vzNIH1n5RbzXbxK7ip+/CohlIXx9ZHATDm6se400OZztMTiLWBe1Y9NshmQO uBHQ==
Hi All,
some of you might have questions regarding the LDA project. I made
some clarification below. Hope they will help you finish your project.
input file:
----
for news corpus, please treat each file as a document. Then removing
"newlines" of the document and make it as a line of text in the input
file. In other words, you do not need to consider paragraphs. As a
result, you will have multiple lines in the input file.
Output file:
---
Please follow this format for your output:
Topic id: top word 1, top word 2, ...... top word 15
so when topic=100, you only have 100 lines in your output. Please note
that, the ordering of word per topic is ranked based on on their
probability to the topic (from highest to lowest).
Optional question
---
you can come up as many suggestions as you want, as long as they make
sense to you (note that you need to justify your suggestion by saying
why it might help improve the performance).
LDA program
---
There are two ways to run the program: 1) you can import the program
into Eclipse and run it there. (you can specific the location of
output file easily). or 2) you can use the commend line. Either way is
fine.
Thanks.
--
Best regards,
Yuheng Hu